MATHEMATICAL PRELIMINARIES
5.6 APPROXIMATION OF ENTROPY BY NONPOLYNOMIAL FUNCTIONS
In the previous section, we introduced cumulant-based approximations of (neg)entropy. However, such cumulant-based methods sometimes provide a rather poor approximation of entropy. There are two main reasons for this. First, finite-sample estimators of higher-order cumulants are highly sensitive to outliers: their values may depend on only a few, possibly erroneous, observations with large values.
This means that outliers may completely determine the estimates of cumulants, thus making them useless. Second, even if the cumulants were estimated perfectly, they mainly measure the tails of the distribution, and are largely unaffected by structure near the center of the distribution. This is because expectations of polynomials like the fourth power are much more strongly affected by data far away from zero than by data close to zero.
In this section, we introduce entropy approximations that are based on an ap-proximative maximum entropy method. The motivation for this approach is that the entropy of a distribution cannot be determined from a given finite number of esti-mated expectations as in (5.21), even if these were estiesti-mated exactly. As explained in
Section 5.3, there exist an infinite number of distributions for which the constraints in (5.21) are fulfilled, but whose entropies are very different from each other. In particular, the differential entropy reaches?1in the limit where
x
takes only a finite number of values.A simple solution to this is the maximum entropy method. This means that we compute the maximum entropy that is compatible with our constraints or measure-ments in (5.21), which is a well-defined problem. This maximum entropy, or further approximations thereof, can then be used as a meaningful approximation of the en-tropy of a random variable. This is because in ICA we usually want to minimize entropy. The maximum entropy method gives an upper bound for entropy, and its minimization is likely to minimize the true entropy as well.
In this section, we first derive a first-order approximation of the maximum entropy density for a continuous one-dimensional random variable, given a number of simple constraints. This results in a density expansion that is somewhat similar to the classic polynomial density expansions by Gram-Charlier and Edgeworth. Using this approximation of density, an approximation of 1-D differential entropy is derived.
The approximation of entropy is both more exact and more robust against outliers than the approximations based on the polynomial density expansions, without being computationally more expensive.
5.6.1 Approximating the maximum entropy
Let us thus assume that we have observed (or, in practice, estimated) a number of expectations of
x
, of the formZ
p ( ) F
i( ) d = c
i;
fori = 1 ;:::;m
(5.36)The functions
F
i are not, in general, polynomials. In fact, if we used simple polynomials, we would end up with something very similar to what we had in the preceding section.Since in general the maximum entropy equations cannot be solved analytically, we make a simple approximation of the maximum entropy density
p
0. This is based on the assumption that the densityp ( )
is not very far from the gaussian density of the same mean and variance; this assumption is similar to the one made using polynomial density expansions.As with the polynomial expansions, we can assume that
x
has zero mean and unit variance. Therefore we put two additional constraints in (5.36), defined byF
n+1( ) = ; c
n+1= 0
(5.37)F
n+2( ) =
2; c
n+2= 1
(5.38)To further simplify the calculations, let us make another, purely technical assumption:
The functions
F
i;i = 1 ;:::;n
, form an orthonormal system according to the metric defined by'
in (5.27), and are orthogonal to all polynomials of second degree. InAPPROXIMATION OF ENTROPY BY NONPOLYNOMIAL FUNCTIONS 117
other words, for all
i;j = 1 ;:::;n
Z
' ( ) F
i( ) F
j( ) d =
(
1 ;
ifi = j
0 ;
ifi
6= j
(5.39)Z
' ( ) F
i( )
kd = 0 ;
fork = 0 ; 1 ; 2
(5.40)Again, these orthogonality constraints are very similar to those of Chebyshev-Hermite polynomials. For any set of linearly independent functions
F
i(not containing second-order polynomials), this assumption can always be made true by ordinary Gram-Schmidt orthonormalization.Now, note that the assumption of near-gaussianity implies that all the other
a
i in (5.22) are very small compared toa
n+2 ?1 = 2
, since the exponential in (5.22) is not far fromexp(
?2= 2)
. Thus we can make a first-order approximation of the exponential function (detailed derivations can be found in the Appendix). This allows for simple solutions for the constants in (5.22), and we obtain the approximative maximum entropy density, which we denote byp ^ ( )
:p ^ ( ) = ' ( )(1 +
Xni=1
c
iF
i( ))
(5.41)where
c
i= E
fF
i( )
g.Now we can derive an approximation of differential entropy using this density approximation. As with the polynomial density expansions, we can use (5.31) and (5.32). After some algebraic manipulations (see the Appendix), we obtain
J ( x )
1 2
n
X
i=1
E
fF
i( x )
g2 (5.42)Note that even in cases where this approximation is not very accurate, (5.42) can be used to construct a measure of nongaussianity that is consistent in the sense that (5.42) obtains its minimum value, 0, when
x
has a gaussian distribution. This is because according to the latter part of (5.39) withk = 0
, we haveE
fF
i( )
g= 0
.5.6.2 Choosing the nonpolynomial functions
Now it remains to choose the “measuring” functions
F
i that define the information given in (5.36). As noted in Section 5.6.1, one can take practically any set of linearly independent functions, sayG
i;i = 1 ;:::;m
, and then apply Gram-Schmidt orthonormalization on the set containing those functions and the monomialsk;k = 0 ; 1 ; 2
, so as to obtain the setF
ithat fulfills the orthogonality assumptions in (5.39).This can be done, in general, by numerical integration. In the practical choice of the functions
G
i, the following criteria must be emphasized:1. The practical estimation of
E
fG
i( x )
gshould not be statistically difficult. In particular, this estimation should not be too sensitive to outliers.2. The maximum entropy method assumes that the function
p
0in (5.22) is inte-grable. Therefore, to ensure that the maximum entropy distribution exists in the first place, theG
i( x )
must not grow faster than quadratically as a function ofjx
j, because a function growing faster might lead to the nonintegrability ofp
0.3. The
G
i must capture aspects of the distribution ofX
that are pertinent in the computation of entropy. In particular, if the densityp ( )
were known, the optimal functionG
opt would clearly be ?log p ( )
, because?E
flog p ( x )
ggives the entropy directly Thus, one might use for
G
ithe log-densities of some known important densities.The first two criteria are met if the
G
i( x )
are functions that do not grow too fast (not faster than quadratically) asjx
jincreases. This excludes, for example, the use of higher-order polynomials, which are used in the Gram-Charlier and Edgeworth expansions. One might then search, according to criterion 3, for log-densities of some well-known distributions that also fulfill the first two conditions. Examples will be given in the next subsection.It should be noted, however, that the criteria above only delimit the space of functions that can be used. Our framework enables the use of very different functions (or just one) as
G
i. However, if prior knowledge is available on the distributions whose entropy is to be estimated, criterion 3 shows how to choose the optimal function.5.6.3 Simple special cases
A simple special case of (5.41) is obtained if one uses two functions
G
1 andG
2,which are chosen so that
G
1is odd andG
2is even. Such a system of two functions can measure the two most important features of nongaussian 1-D distributions. The odd function measures the asymmetry, and the even function measures the dimension of bimodality vs. peak at zero, closely related to sub- vs. supergaussianity. Classically, these features have been measured by skewness and kurtosis, which correspond toG
1( x ) = x
3 andG
2( x ) = x
4, but we do not use these functions for the reasons explained in Section 5.6.2. (In fact, with these choices, the approximation in (5.41) becomes identical to the one obtained from the Gram-Charlier expansion in (5.35).)In this special case, the approximation in (5.42) simplifies to
J ( x )
k
1( E
fG
1( x )
g)
2+ k
2( E
fG
2( x )
g?E
fG
2( )
g)
2 (5.43)where
k
1 andk
2 are positive constants (see the Appendix). Practical examples of choices ofG
i that are consistent with the requirements in Section 5.6.2 are the following. First, for measuring bimodality/sparsity, one might use, according to the recommendations of Section 5.6.2, the log-density of the Laplacian distribution:G
2a( x ) =
jx
j (5.44)For computational reasons, a smoother version of
G
2amight also be used. Another choice would be the gaussian function, which can be considered as the log-densityAPPROXIMATION OF ENTROPY BY NONPOLYNOMIAL FUNCTIONS 119
of a distribution with infinitely heavy tails (since it stays constant when going to infinity):
G
2b( x ) = exp(
?x
2= 2)
(5.45)For measuring asymmetry, one might use, on more heuristic grounds, the following function:
G
1( x ) = x exp(
?x
2= 2)
(5.46)that is smooth and robust against outliers.
Using the preceding examples one obtains two practical examples of (5.43):
J
a( x ) = k
1( E
fx exp(
?x
2= 2)
g)
2+ k
a2( E
fjx
jg?p2 = )
2 (5.47)and
J
b( x ) = k
1( E
fx exp(
?x
2= 2)
g)
2+ k
b2( E
fexp(
?x
2= 2)
g?p1 = 2)
2(5.48) with
k
1= 36 = (8
p3
?9)
,k
2a= 1 = (2
?6 = )
, andk
2b= 24 = (16
p3
?27)
. Theseapproximations
J
a( x )
andJ
b( x )
can be considered more robust and accurate gen-eralizations of the approximation derived using the Gram-Charlier expansion in Section 5.5.Even simpler approximations of negentropy can be obtained by using only one nonquadratic function, which amounts to omitting one of the terms in the preceding approximations.
5.6.4 Illustration
Here we illustrate the differences in accuracy of the different approximations of negentropy. The expectations were here evaluated exactly, ignoring finite-sample effects. Thus these results do not illustrate the robustness of the maximum entropy approximation with respect to outliers; this is quite evident anyway.
First, we used a family of gaussian mixture densities, defined by
p ( ) = ' ( x ) + (1
?)2 ' (2( x
?1))
(5.49)where
is a parameter that takes all the values in the interval0
1
. Thisfamily includes asymmetric densities of both negative and positive kurtosis. The results are depicted in Fig. 5.2. One can see that both of the approximations
J
a andJ
b introduced in Section 5.6.3 were considerably more accurate than the cumulant-based approximation in (5.35).Second, we considered the exponential power family of density functions:
p ( ) = C
1exp(
?C
2jj)
(5.50)0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.005 0.01 0.015 0.02 0.025 0.03 0.035
Fig. 5.2 Comparison of different approximations of negentropy for the family of mixture densities in (5.49) parametrized by
ranging from 0 to 1 (horizontal axis). Solid curve:true negentropy. Dotted curve: cumulant-based approximation as in (5.35). Dashed curve:
approximation
J
ain (5.47). Dot-dashed curve: approximationJ
bin (5.48). The two maximum entropy approximations were clearly better than the cumulant-based one.where
is a positive constant, and