APPROXIMATION OF ENTROPY BY NONPOLYNOMIAL FUNCTIONS

MATHEMATICAL PRELIMINARIES

5.6 APPROXIMATION OF ENTROPY BY NONPOLYNOMIAL FUNCTIONS

In the previous section, we introduced cumulant-based approximations of (neg)entropy. However, such cumulant-based methods sometimes provide a rather poor approximation of entropy. There are two main reasons for this. First, finite-sample estimators of higher-order cumulants are highly sensitive to outliers: their values may depend on only a few, possibly erroneous, observations with large values.

This means that outliers may completely determine the estimates of cumulants, thus making them useless. Second, even if the cumulants were estimated perfectly, they mainly measure the tails of the distribution, and are largely unaffected by structure near the center of the distribution. This is because expectations of polynomials like the fourth power are much more strongly affected by data far away from zero than by data close to zero.

In this section, we introduce entropy approximations that are based on an ap-proximative maximum entropy method. The motivation for this approach is that the entropy of a distribution cannot be determined from a given finite number of esti-mated expectations as in (5.21), even if these were estiesti-mated exactly. As explained in

Section 5.3, there exist an infinite number of distributions for which the constraints in (5.21) are fulfilled, but whose entropies are very different from each other. In particular, the differential entropy reaches^?1in the limit where

x

takes only a finite number of values.

A simple solution to this is the maximum entropy method. This means that we compute the maximum entropy that is compatible with our constraints or measure-ments in (5.21), which is a well-defined problem. This maximum entropy, or further approximations thereof, can then be used as a meaningful approximation of the en-tropy of a random variable. This is because in ICA we usually want to minimize entropy. The maximum entropy method gives an upper bound for entropy, and its minimization is likely to minimize the true entropy as well.

In this section, we first derive a first-order approximation of the maximum entropy density for a continuous one-dimensional random variable, given a number of simple constraints. This results in a density expansion that is somewhat similar to the classic polynomial density expansions by Gram-Charlier and Edgeworth. Using this approximation of density, an approximation of 1-D differential entropy is derived.

The approximation of entropy is both more exact and more robust against outliers than the approximations based on the polynomial density expansions, without being computationally more expensive.

5.6.1 Approximating the maximum entropy

Let us thus assume that we have observed (or, in practice, estimated) a number of expectations of

x

, of the form

p ( ) F

ⁱ

( ) d = c

;

^for

i = 1 ;:::;m

^(5.36)

The functions

F

ⁱ are not, in general, polynomials. In fact, if we used simple polynomials, we would end up with something very similar to what we had in the preceding section.

Since in general the maximum entropy equations cannot be solved analytically, we make a simple approximation of the maximum entropy density

p

⁰. This is based on the assumption that the density

p ( )

is not very far from the gaussian density of the same mean and variance; this assumption is similar to the one made using polynomial density expansions.

As with the polynomial expansions, we can assume that

x

has zero mean and unit variance. Therefore we put two additional constraints in (5.36), defined by

F

ⁿ⁺¹

( ) = ; c

n⁺¹

= 0

^(5.37)

F

ⁿ⁺²

( ) =

; c

n⁺²

= 1

^(5.38)

To further simplify the calculations, let us make another, purely technical assumption:

The functions

F

ⁱ

;i = 1 ;:::;n

, form an orthonormal system according to the metric defined by

'

in (5.27), and are orthogonal to all polynomials of second degree. In

APPROXIMATION OF ENTROPY BY NONPOLYNOMIAL FUNCTIONS 117

other words, for all

i;j = 1 ;:::;n

' ( ) F

ⁱ

( ) F

( ) d =

(

1 ;

^if

i = j

0 ;

^if

i

⁶

= j

^(5.39)

' ( ) F

ⁱ

( )

d = 0 ;

^for

k = 0 ; 1 ; 2

^(5.40)

Again, these orthogonality constraints are very similar to those of Chebyshev-Hermite polynomials. For any set of linearly independent functions

F

ⁱ(not containing second-order polynomials), this assumption can always be made true by ordinary Gram-Schmidt orthonormalization.

Now, note that the assumption of near-gaussianity implies that all the other

a

i ⁱⁿ (5.22) are very small compared to

a

n⁺² ^?

1 = 2

, since the exponential in (5.22) is not far from

exp(

= 2)

. Thus we can make a first-order approximation of the exponential function (detailed derivations can be found in the Appendix). This allows for simple solutions for the constants in (5.22), and we obtain the approximative maximum entropy density, which we denote by

p ^ ( )

p ^ ( ) = ' ( )(1 +

^Xⁿ

i⁼¹

c

F

ⁱ

( ))

^(5.41)

where

c

= E

F

ⁱ

( )

^g^.

Now we can derive an approximation of differential entropy using this density approximation. As with the polynomial density expansions, we can use (5.31) and (5.32). After some algebraic manipulations (see the Appendix), we obtain

J ( x )

1 2

i⁼¹

E

F

ⁱ

( x )

^g² ^(5.42)

Note that even in cases where this approximation is not very accurate, (5.42) can be used to construct a measure of nongaussianity that is consistent in the sense that (5.42) obtains its minimum value, 0, when

x

has a gaussian distribution. This is because according to the latter part of (5.39) with

k = 0

^{, we have}

E

F

ⁱ

( )

= 0

5.6.2 Choosing the nonpolynomial functions

Now it remains to choose the “measuring” functions

F

ⁱ that define the information given in (5.36). As noted in Section 5.6.1, one can take practically any set of linearly independent functions, say

G

ⁱ

;i = 1 ;:::;m

, and then apply Gram-Schmidt orthonormalization on the set containing those functions and the monomials

;k = 0 ; 1 ; 2

, so as to obtain the set

F

ⁱthat fulfills the orthogonality assumptions in (5.39).

This can be done, in general, by numerical integration. In the practical choice of the functions

G

ⁱ, the following criteria must be emphasized:

1. The practical estimation of

E

G

ⁱ

( x )

^gshould not be statistically difficult. In particular, this estimation should not be too sensitive to outliers.

2. The maximum entropy method assumes that the function

p

⁰in (5.22) is inte-grable. Therefore, to ensure that the maximum entropy distribution exists in the first place, the

G

ⁱ

( x )

must not grow faster than quadratically as a function of^j

x

^j, because a function growing faster might lead to the nonintegrability of

p

⁰^.

3. The

G

ⁱ must capture aspects of the distribution of

X

that are pertinent in the computation of entropy. In particular, if the density

p ( )

were known, the optimal function

G

^opt would clearly be ^?

log p ( )

^{, because}^?

E

log p ( x )

gives the entropy directly Thus, one might use for

G

ⁱthe log-densities of some known important densities.

The first two criteria are met if the

G

ⁱ

( x )

are functions that do not grow too fast (not faster than quadratically) as^j

x

^jincreases. This excludes, for example, the use of higher-order polynomials, which are used in the Gram-Charlier and Edgeworth expansions. One might then search, according to criterion 3, for log-densities of some well-known distributions that also fulfill the first two conditions. Examples will be given in the next subsection.

It should be noted, however, that the criteria above only delimit the space of functions that can be used. Our framework enables the use of very different functions (or just one) as

G

ⁱ. However, if prior knowledge is available on the distributions whose entropy is to be estimated, criterion 3 shows how to choose the optimal function.

5.6.3 Simple special cases

A simple special case of (5.41) is obtained if one uses two functions

G

¹ ^and

G

²^,

which are chosen so that

G

¹^{is odd and}

G

²is even. Such a system of two functions can measure the two most important features of nongaussian 1-D distributions. The odd function measures the asymmetry, and the even function measures the dimension of bimodality vs. peak at zero, closely related to sub- vs. supergaussianity. Classically, these features have been measured by skewness and kurtosis, which correspond to

G

( x ) = x

³ ^and

G

( x ) = x

⁴, but we do not use these functions for the reasons explained in Section 5.6.2. (In fact, with these choices, the approximation in (5.41) becomes identical to the one obtained from the Gram-Charlier expansion in (5.35).)

In this special case, the approximation in (5.42) simplifies to

J ( x )

k

( E

G

( x )

)

+ k

( E

G

( x )

^g^?

E

G

( )

)

² ^(5.43)

where

k

¹ ^and

k

² are positive constants (see the Appendix). Practical examples of choices of

G

ⁱ that are consistent with the requirements in Section 5.6.2 are the following. First, for measuring bimodality/sparsity, one might use, according to the recommendations of Section 5.6.2, the log-density of the Laplacian distribution:

G

²^a

( x ) =

x

^j ^(5.44)

For computational reasons, a smoother version of

G

²^amight also be used. Another choice would be the gaussian function, which can be considered as the log-density

APPROXIMATION OF ENTROPY BY NONPOLYNOMIAL FUNCTIONS 119

of a distribution with infinitely heavy tails (since it stays constant when going to infinity):

G

²^b

( x ) = exp(

x

= 2)

^(5.45)

For measuring asymmetry, one might use, on more heuristic grounds, the following function:

G

( x ) = x exp(

x

= 2)

^(5.46)

that is smooth and robust against outliers.

Using the preceding examples one obtains two practical examples of (5.43):

J

( x ) = k

( E

x exp(

x

= 2)

)

+ k

^a²

( E

^fj

x

^jg^?^p

2 = )

² ^(5.47)

and

J

( x ) = k

( E

x exp(

x

= 2)

)

+ k

^b²

( E

exp(

x

= 2)

^g^?^p

1 = 2)

(5.48) with

k

= 36 = (8

3 9)

k

²^a

= 1 = (2

6 = )

^{, and}

k

²^b

= 24 = (16

3 27)

^{. These}

approximations

J

( x )

^and

J

( x )

can be considered more robust and accurate gen-eralizations of the approximation derived using the Gram-Charlier expansion in Section 5.5.

Even simpler approximations of negentropy can be obtained by using only one nonquadratic function, which amounts to omitting one of the terms in the preceding approximations.

5.6.4 Illustration

Here we illustrate the differences in accuracy of the different approximations of negentropy. The expectations were here evaluated exactly, ignoring finite-sample effects. Thus these results do not illustrate the robustness of the maximum entropy approximation with respect to outliers; this is quite evident anyway.

First, we used a family of gaussian mixture densities, defined by

p ( ) = ' ( x ) + (1

)2 ' (2( x

1))

^(5.49)

where

is a parameter that takes all the values in the interval

0

1

^{. This}

family includes asymmetric densities of both negative and positive kurtosis. The results are depicted in Fig. 5.2. One can see that both of the approximations

J

_a ^and

J

_b introduced in Section 5.6.3 were considerably more accurate than the cumulant-based approximation in (5.35).

Second, we considered the exponential power family of density functions:

p ( ) = C

exp(

C

²^j

)

^(5.50)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.005 0.01 0.015 0.02 0.025 0.03 0.035

Fig. 5.2 Comparison of different approximations of negentropy for the family of mixture densities in (5.49) parametrized by

ranging from 0 to 1 (horizontal axis). Solid curve:

true negentropy. Dotted curve: cumulant-based approximation as in (5.35). Dashed curve:

approximation

J

^ain (5.47). Dot-dashed curve: approximation

J

^bin (5.48). The two maximum entropy approximations were clearly better than the cumulant-based one.

where

is a positive constant, and

C

;C

² are normalization constants that make

p

a probability density of unit variance. For different values of

, the densities in this family exhibit different shapes. For

< 2

, one obtains densities of positive kurtosis (supergaussian). For

= 2

, one obtains the gaussian density, and for

> 2

, a density of negative kurtosis. Thus the densities in this family can be used as examples of different symmetric nongaussian densities. In Fig. 5.3, the different negentropy approximations are plotted for this family, using parameter values

0 : 5

3

. Since the densities used are all symmetric, the first terms in the approximations were neglected. Again, it is clear that both of the approximations

J

a ^and

J

b introduced in Section 5.6.3 were considerably more accurate than the cumulant-based approximation in (5.35). Especially in the case of supergaussian densities, the cumulant-based approximation performed very poorly; this is probably because it gives too much weight to the tails of the distribution.

No documento Independent Component Analysis (páginas 137-142)