CONCLUDING REMARKS AND REFERENCES

MATHEMATICAL PRELIMINARIES

5.7 CONCLUDING REMARKS AND REFERENCES

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.005 0.01 0.015 0.02 0.025 0.03 0.035

Fig. 5.2 Comparison of different approximations of negentropy for the family of mixture densities in (5.49) parametrized by

ranging from 0 to 1 (horizontal axis). Solid curve:

true negentropy. Dotted curve: cumulant-based approximation as in (5.35). Dashed curve:

approximation

J

^ain (5.47). Dot-dashed curve: approximation

J

^bin (5.48). The two maximum entropy approximations were clearly better than the cumulant-based one.

where

is a positive constant, and

C

;C

² are normalization constants that make

p

a probability density of unit variance. For different values of

, the densities in this family exhibit different shapes. For

< 2

, one obtains densities of positive kurtosis (supergaussian). For

= 2

, one obtains the gaussian density, and for

> 2

, a density of negative kurtosis. Thus the densities in this family can be used as examples of different symmetric nongaussian densities. In Fig. 5.3, the different negentropy approximations are plotted for this family, using parameter values

0 : 5

3

. Since the densities used are all symmetric, the first terms in the approximations were neglected. Again, it is clear that both of the approximations

J

a ^and

J

b introduced in Section 5.6.3 were considerably more accurate than the cumulant-based approximation in (5.35). Especially in the case of supergaussian densities, the cumulant-based approximation performed very poorly; this is probably because it gives too much weight to the tails of the distribution.

PROBLEMS 121

0.5 1 1.5 2

0 0.1 0.2 0.3 0.4 0.5 0.6

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014

Fig. 5.3 Comparison of different approximations of negentropy for the family of densities (5.50) parametrized by

(horizontal axis). On the left, approximations for densities of positive kurtosis (⁰

:

⁵

<

²) are depicted, and on the right, approximations for densities of negative kurtosis (²

<

³). Solid curve: true negentropy. Dotted curve: cumulant-based approximation as in (5.35). Dashed curve: approximation

J

^a in (5.47). Dot-dashed curve:

approximation

J

^bin (5.48). Clearly, the maximum entropy approximations were much better than the cumulant-based one, especially in the case of densities of positive kurtosis.

Problems

5.1 Assume that the random variable

X

can have two values,

a

^and

b

, as in Example 5.1. Compute the entropy as a function of the probability of obtaining

a

. Show that this is maximized when the probability is

1 = 2

5.2 Compute the entropy of

X

in Example 5.3.

5.3 Assume

x

has a Laplacian distribution of arbitrary variance with pdf

p

( ) = 1

2 ^exp(

2 )

^(5.51)

Compute the differential entropy.

5.4 Prove (5.15).

5.5 Prove (5.25).

5.6 Show that the definition of mutual information using Kullback-Leibler diver-gence is equal to the one given by entropy.

5.7 Compute the three first Chebyshev-Hermite polynomials.

5.8 Prove (5.34). Use the orthogonality in (5.29), and in particular the fact that

H

³ ^and

H

⁴are orthogonal to any second-order polynomial (prove this first!). Fur-thermore, use the fact that any expression involving a third-order monomial of the

higher-order cumulants is infinitely smaller than terms involving only second-order monomials (due to the assumption that the pdf is very close to gaussian).

Computer assignments

^(A.1)

^(A.3)

where the Jacobians cancel each other, and we have

H

^{(y )}⁼^?

p

^x()^log[

p

^x()]

d

⁺

p

^x()^log^j^det

J

^f()j

d

^(A.4)

which gives (5.13).

Now follow the proofs connected with the entropy approximations. First, we prove (5.41).

Due to the assumption of near-gaussianity, we can write

p

⁰⁽

⁾^as

p

⁰⁽

⁾⁼

A

^exp(?

=

²⁺

a

ⁿ⁺¹

⁺⁽

a

ⁿ⁺²⁺¹

=

²⁾

²⁺^Xⁿ

i=1

a

ⁱ

G

ⁱ⁽

⁾⁾

;

^(A.5)

APPENDIX 123

where in the exponential, all other terms are very small with respect to the first one. Thus, using the first-order approximation^exp(

⁾¹⁺

, we obtain

p

⁰⁽

⁾

A'

^~ ⁽

⁾⁽¹⁺

a

ⁿ⁺¹

⁺⁽

a

ⁿ⁺²⁺¹

=

²⁾

²⁺^Xⁿ

i=1

a

ⁱ

G

ⁱ⁽

⁾⁾

;

^(A.6)

where

'

⁽

⁾⁼⁽²

p

⁰⁽

⁾

d

⁼

A

^~⁽¹⁺⁽

a

ⁿ⁺²⁺¹

=

²⁾⁾⁼¹ ^(A.7)

p

⁰⁽

⁾

d

⁼

Aa

^~ ⁿ⁺¹⁼⁰ ^(A.8)

p

⁰⁽

⁾

d

⁼

A

^~⁽¹⁺³⁽

a

ⁿ⁺²⁺¹

=

²⁾⁾⁼¹ ^(A.9)

p

⁰⁽

⁾

G

ⁱ⁽

⁾

d

⁼

Aa

^~ ⁱ⁼

c

ⁱ

;

^for

i

⁼¹

;:::;n

^(A.10)

and can be easily solved to yield

A

^~⁼¹

;a

ⁿ⁺¹⁼⁰^,

a

ⁿ⁺²⁼^?1

=

²^and

a

ⁱ⁼

c

ⁱ

;i

⁼¹

;::;n

This gives (5.41).

Second, we prove (5.42). Using the Taylor expansion⁽¹⁺

⁾^log(1⁺

⁾⁼

⁺

=

²⁺

o

⁽

²⁾^,

one obtains

? Z

p

^⁽

⁾^log

p

^{^}⁽

⁾

d

^(A.11)

=? Z

'

⁽

⁾⁽¹⁺^X

c

ⁱ

G

ⁱ⁽

^))(log(1⁺^X

⁽

⁾^X

c

ⁱ

G

ⁱ⁽

⁾^log

'

⁽

⁾ ^(A.13)

? Z

'

⁽

^)[^X

c

ⁱ

G

ⁱ⁽

⁾⁺¹

2 (

c

ⁱ

G

ⁱ⁽

⁾⁾²⁺

o

⁽⁽^X

c

ⁱ

G

ⁱ⁽

⁾⁾²^)]

(A.14)

H

⁽

²⁺

'

⁽

⁾

⁽

G

¹⁽

⁾⁺

⁾

d

⁼⁰ ^(A.16)

'

⁽

⁾

^k⁽

G

²⁽

⁾⁺

²⁺

²⁾

d

⁼⁰

;

^for

k

⁼⁰

;

² ^(A.17)

A straightforward solution gives:

¹⁼^?

'

⁾

d

⁾ ^(A.19)

²⁼ ¹

2 (

'

⁽

⁾

G

²⁽

⁾

d

^?³

'

⁽

⁾

G

²⁽

⁾

d

⁾ ^(A.20)

Next note that together with the standardization

'

⁽

⁾⁽

G

²⁽

⁾⁺

²⁺

²⁾

d

⁼⁰^implies

c

ⁱ⁼

E

F

ⁱ⁽

x

^)g⁼^[

E

G

ⁱ⁽

x

^)g^?

E

G

ⁱ⁽

^)g]

=

ⁱ ^(A.21)

This implies (5.43), with

k

²ⁱ ⁼¹

=

⁽²

ⁱ²⁾. Thus we only need to determine explicitly the

ⁱ^for

each function. We solve the two equations

'

⁽

⁾⁽

G

¹⁽

⁾⁺

⁾²

=

d

⁼¹ ^(A.22)

'

⁽

⁾⁽

G

²⁽

⁾⁺

²⁺

²⁾²

=

d

⁼¹ ^(A.23)

which, after some tedious manipulations, yield:

²¹⁼^Z

'

⁽

⁾

G

¹⁽

⁾²

^?⁽^Z

'

⁾

d

⁾²

:

^(A.25)

Evaluating the

ⁱ for the given functions

G

ⁱ, one obtains (5.47) and (5.48) by the relation

k

ⁱ²⁼¹

=

⁽²

ⁱ²⁾^.

6

Principal Component Analysis and Whitening

Principal component analysis (PCA) and the closely related Karhunen-Lo`eve trans-form, or the Hotelling transtrans-form, are classic techniques in statistical data analysis, feature extraction, and data compression, stemming from the early work of Pearson [364]. Given a set of multivariate measurements, the purpose is to find a smaller set of variables with less redundancy, that would give as good a representation as possible.

This goal is related to the goal of independent component analysis (ICA). However, in PCA the redundancy is measured by correlations between data elements, while in ICA the much richer concept of independence is used, and in ICA the reduction of the number of variables is given less emphasis. Using only the correlations as in PCA has the advantage that the analysis can be based on second-order statistics only.

In connection with ICA, PCA is a useful preprocessing step.

The basic PCA problem is outlined in this chapter. Both the closed-form solution and on-line learning algorithms for PCA are reviewed. Next, the related linear statistical technique of factor analysis is discussed. The chapter is concluded by presenting how data can be preprocessed by whitening, removing the effect of first-and second-order statistics, which is very helpful as the first step in ICA.

No documento Independent Component Analysis (páginas 142-147)

MATHEMATICAL PRELIMINARIES

5.7 CONCLUDING REMARKS AND REFERENCES

J

J

C

;C

p

< 2

= 2

> 2

0 : 5

3

J

J

:

<

<

J

J

X

a

b

a

1 = 2

X

x

p

( ) = 1

2 exp(

2

)

H

H

H

p

p

d

p

J

p

J

d

p

p

J

d

p

J

J

d

H

p

p

J

J

d

p

J

J

J

d

H

p

p

d

p

J

d

p

p

A

=

a

a

=

a

G

;

p

2 ^exp(