MATHEMATICAL PRELIMINARIES
5.7 CONCLUDING REMARKS AND REFERENCES
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.005 0.01 0.015 0.02 0.025 0.03 0.035
Fig. 5.2 Comparison of different approximations of negentropy for the family of mixture densities in (5.49) parametrized by
ranging from 0 to 1 (horizontal axis). Solid curve:true negentropy. Dotted curve: cumulant-based approximation as in (5.35). Dashed curve:
approximation
J
ain (5.47). Dot-dashed curve: approximationJ
bin (5.48). The two maximum entropy approximations were clearly better than the cumulant-based one.where
is a positive constant, andC
1;C
2 are normalization constants that makep
a probability density of unit variance. For different values of, the densities in this family exhibit different shapes. For< 2
, one obtains densities of positive kurtosis (supergaussian). For= 2
, one obtains the gaussian density, and for> 2
, a density of negative kurtosis. Thus the densities in this family can be used as examples of different symmetric nongaussian densities. In Fig. 5.3, the different negentropy approximations are plotted for this family, using parameter values0 : 5
3
. Since the densities used are all symmetric, the first terms in the approximations were neglected. Again, it is clear that both of the approximationsJ
a andJ
b introduced in Section 5.6.3 were considerably more accurate than the cumulant-based approximation in (5.35). Especially in the case of supergaussian densities, the cumulant-based approximation performed very poorly; this is probably because it gives too much weight to the tails of the distribution.PROBLEMS 121
0.5 1 1.5 2
0 0.1 0.2 0.3 0.4 0.5 0.6
2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3
0 0.002 0.004 0.006 0.008 0.01 0.012 0.014
Fig. 5.3 Comparison of different approximations of negentropy for the family of densities (5.50) parametrized by
(horizontal axis). On the left, approximations for densities of positive kurtosis (0:
5<
2) are depicted, and on the right, approximations for densities of negative kurtosis (2<
3). Solid curve: true negentropy. Dotted curve: cumulant-based approximation as in (5.35). Dashed curve: approximationJ
a in (5.47). Dot-dashed curve:approximation
J
bin (5.48). Clearly, the maximum entropy approximations were much better than the cumulant-based one, especially in the case of densities of positive kurtosis.Problems
5.1 Assume that the random variable
X
can have two values,a
andb
, as in Example 5.1. Compute the entropy as a function of the probability of obtaininga
. Show that this is maximized when the probability is1 = 2
.5.2 Compute the entropy of
X
in Example 5.3.5.3 Assume
x
has a Laplacian distribution of arbitrary variance with pdfp
x( ) = 1
p2 exp(
p
2
jj)
(5.51)Compute the differential entropy.
5.4 Prove (5.15).
5.5 Prove (5.25).
5.6 Show that the definition of mutual information using Kullback-Leibler diver-gence is equal to the one given by entropy.
5.7 Compute the three first Chebyshev-Hermite polynomials.
5.8 Prove (5.34). Use the orthogonality in (5.29), and in particular the fact that
H
3 andH
4are orthogonal to any second-order polynomial (prove this first!). Fur-thermore, use the fact that any expression involving a third-order monomial of thehigher-order cumulants is infinitely smaller than terms involving only second-order monomials (due to the assumption that the pdf is very close to gaussian).
Computer assignments
5.1 Consider random variables with (1) a uniform distribution and (2) a Laplacian distribution, both with zero mean and unit variance. Compute their differential entropies with numerical integration. Then, compute the approximations given by the polynomial and nonpolynomial approximations given in this chapter. Compare the results.
Appendix proofs
First, we give a detailed proof of (5.13). We have by (5.10)
H
(y )=?Z
p
y()logp
y()d
=? Z
p
x(f?1())jdetJ
f(f?1())j?1log[p
x(f?1())jdetJ
f(f?1())j?1]d
=? Z
p
x(f?1())log[p
x(f?1())]jdetJ
f(f?1())j?1d
? Z
p
x(f?1())log[jdetJ
f(f?1())j?1]jdetJ
f(f?1())j?1d
(A.1)Now, let us make the change of integration variable
=f
?1
() (A.2)
which gives us
H
(y )=?Z
p
x()log[p
x()]jdetJ
f()j?1jdetJ
f()jd
? Z
p
x()log[jdetJ
f()j?1]jdetJ
f()j?1jdetJ
f()jd
(A.3)where the Jacobians cancel each other, and we have
H
(y )=?Z
p
x()log[p
x()]d
+Z
p
x()logjdetJ
f()jd
(A.4)which gives (5.13).
Now follow the proofs connected with the entropy approximations. First, we prove (5.41).
Due to the assumption of near-gaussianity, we can write
p
0()asp
0()=A
exp(?2=
2+a
n+1+(a
n+2+1=
2)2+Xni=1
a
iG
i());
(A.5)APPENDIX 123
where in the exponential, all other terms are very small with respect to the first one. Thus, using the first-order approximationexp(
)1+, we obtainp
0()A'
~ ()(1+a
n+1+(a
n+2+1=
2)2+Xni=1
a
iG
i());
(A.6)where
'
()=(2)?1=2exp(?2=
2)is the standardized gaussian density, andA
~=p2A
.Due to the orthogonality constraints in (5.39), the equations for solving
A
~anda
ibecomelinear and almost diagonal:
Z
p
0()d
=A
~(1+(a
n+2+1=
2))=1 (A.7)Z
p
0()d
=Aa
~ n+1=0 (A.8)Z
p
0()2d
=A
~(1+3(a
n+2+1=
2))=1 (A.9)Z
p
0()G
i()d
=Aa
~ i=c
i;
fori
=1;:::;n
(A.10)and can be easily solved to yield
A
~=1;a
n+1=0,a
n+2=?1=
2anda
i=c
i;i
=1;::;n
.This gives (5.41).
Second, we prove (5.42). Using the Taylor expansion(1+
)log(1+)=+2=
2+o
(2),one obtains
? Z
p
^()logp
^()d
(A.11)=? Z
'
()(1+Xc
iG
i())(log(1+Xc
iG
i())+log'
())d
(A.12)
=? Z
'
()log'
()?Z
'
()Xc
iG
i()log'
() (A.13)? Z
'
()[Xc
iG
i()+12 (
X
c
iG
i())2+o
((Xc
iG
i())2)](A.14)
=
H
()?0?0?12
X
c
2i +o
((Xc
i)2) (A.15)due to the orthogonality relationships in (5.39).
Finally, we prove (5.43), (5.47) and (5.48). First, we must orthonormalize the two functions
G
1andG
2according to (5.39). To do this, it is enough to determine constants1;
1;
2;
2;
2so that the functions
F
1(x
)=(G
1(x
)+1x
)=
1 andF
2(x
)=(G
2(x
)+2x
2+2)=
2are orthogonal to any second degree polynomials as in (5.39), and have unit norm in the metric defined by
'
. In fact, as will be seen below, this modification gives aG
1that is odd and aG
2that is even, and therefore theG
iare automatically orthogonal with respect to each other.Thus, first we solve the following equations:
Z
'
()(G
1()+1)d
=0 (A.16)Z
'
()k(G
2()+22+2)d
=0;
fork
=0;
2 (A.17)A straightforward solution gives:
1=?Z
'
()G
1()d
(A.18) 2=12 (
Z
'
()G
2()d
?Z
'
()G
2()2d
) (A.19) 2= 12 (
Z
'
()G
2()2d
?3Z
'
()G
2()d
) (A.20)Next note that together with the standardization
R
'
()(G
2()+22+2)d
=0impliesc
i=E
fF
i(x
)g=[E
fG
i(x
)g?E
fG
i()g]=
i (A.21)This implies (5.43), with
k
2i =1=
(2i2). Thus we only need to determine explicitly theiforeach function. We solve the two equations
Z
'
()(G
1()+1)2=
1d
=1 (A.22)Z
'
()(G
2()+22+2)2=
2d
=1 (A.23)which, after some tedious manipulations, yield:
21=Z'
()G
1()2d
?(Z'
()G
1()d
)2 (A.24) 22=Z'
()G
2()2d
?(Z'
()G
2()d
)2? 1
2 (
Z
'
()G
2()d
?Z
'
()G
2()2d
)2:
(A.25)Evaluating the
i for the given functionsG
i, one obtains (5.47) and (5.48) by the relationk
i2=1=
(2i2).6
Principal Component Analysis and Whitening
Principal component analysis (PCA) and the closely related Karhunen-Lo`eve trans-form, or the Hotelling transtrans-form, are classic techniques in statistical data analysis, feature extraction, and data compression, stemming from the early work of Pearson [364]. Given a set of multivariate measurements, the purpose is to find a smaller set of variables with less redundancy, that would give as good a representation as possible.
This goal is related to the goal of independent component analysis (ICA). However, in PCA the redundancy is measured by correlations between data elements, while in ICA the much richer concept of independence is used, and in ICA the reduction of the number of variables is given less emphasis. Using only the correlations as in PCA has the advantage that the analysis can be based on second-order statistics only.
In connection with ICA, PCA is a useful preprocessing step.
The basic PCA problem is outlined in this chapter. Both the closed-form solution and on-line learning algorithms for PCA are reviewed. Next, the related linear statistical technique of factor analysis is discussed. The chapter is concluded by presenting how data can be preprocessed by whitening, removing the effect of first-and second-order statistics, which is very helpful as the first step in ICA.