• Nenhum resultado encontrado

CONCLUDING REMARKS AND REFERENCES

No documento Independent Component Analysis (páginas 142-147)

MATHEMATICAL PRELIMINARIES

5.7 CONCLUDING REMARKS AND REFERENCES

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.005 0.01 0.015 0.02 0.025 0.03 0.035

Fig. 5.2 Comparison of different approximations of negentropy for the family of mixture densities in (5.49) parametrized by

ranging from 0 to 1 (horizontal axis). Solid curve:

true negentropy. Dotted curve: cumulant-based approximation as in (5.35). Dashed curve:

approximation

J

ain (5.47). Dot-dashed curve: approximation

J

bin (5.48). The two maximum entropy approximations were clearly better than the cumulant-based one.

where

is a positive constant, and

C

1

;C

2 are normalization constants that make

p

a probability density of unit variance. For different values of

, the densities in this family exhibit different shapes. For

< 2

, one obtains densities of positive kurtosis (supergaussian). For

= 2

, one obtains the gaussian density, and for

> 2

, a density of negative kurtosis. Thus the densities in this family can be used as examples of different symmetric nongaussian densities. In Fig. 5.3, the different negentropy approximations are plotted for this family, using parameter values

0 : 5

3

. Since the densities used are all symmetric, the first terms in the approximations were neglected. Again, it is clear that both of the approximations

J

a and

J

b introduced in Section 5.6.3 were considerably more accurate than the cumulant-based approximation in (5.35). Especially in the case of supergaussian densities, the cumulant-based approximation performed very poorly; this is probably because it gives too much weight to the tails of the distribution.

PROBLEMS 121

0.5 1 1.5 2

0 0.1 0.2 0.3 0.4 0.5 0.6

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014

Fig. 5.3 Comparison of different approximations of negentropy for the family of densities (5.50) parametrized by

(horizontal axis). On the left, approximations for densities of positive kurtosis (0

:

5

<

2) are depicted, and on the right, approximations for densities of negative kurtosis (2

<

3). Solid curve: true negentropy. Dotted curve: cumulant-based approximation as in (5.35). Dashed curve: approximation

J

a in (5.47). Dot-dashed curve:

approximation

J

bin (5.48). Clearly, the maximum entropy approximations were much better than the cumulant-based one, especially in the case of densities of positive kurtosis.

Problems

5.1 Assume that the random variable

X

can have two values,

a

and

b

, as in Example 5.1. Compute the entropy as a function of the probability of obtaining

a

. Show that this is maximized when the probability is

1 = 2

.

5.2 Compute the entropy of

X

in Example 5.3.

5.3 Assume

x

has a Laplacian distribution of arbitrary variance with pdf

p

x

( ) = 1

p

2 exp(

p

2

j

j

)

(5.51)

Compute the differential entropy.

5.4 Prove (5.15).

5.5 Prove (5.25).

5.6 Show that the definition of mutual information using Kullback-Leibler diver-gence is equal to the one given by entropy.

5.7 Compute the three first Chebyshev-Hermite polynomials.

5.8 Prove (5.34). Use the orthogonality in (5.29), and in particular the fact that

H

3 and

H

4are orthogonal to any second-order polynomial (prove this first!). Fur-thermore, use the fact that any expression involving a third-order monomial of the

higher-order cumulants is infinitely smaller than terms involving only second-order monomials (due to the assumption that the pdf is very close to gaussian).

Computer assignments

5.1 Consider random variables with (1) a uniform distribution and (2) a Laplacian distribution, both with zero mean and unit variance. Compute their differential entropies with numerical integration. Then, compute the approximations given by the polynomial and nonpolynomial approximations given in this chapter. Compare the results.

Appendix proofs

First, we give a detailed proof of (5.13). We have by (5.10)

H

(y )=?

Z

p

y()log

p

y()

d

=? Z

p

x(f?1())jdet

J

f(f?1())j?1log[

p

x(f?1())jdet

J

f(f?1())j?1]

d

=? Z

p

x(f?1())log[

p

x(f?1())]jdet

J

f(f?1())j?1

d

? Z

p

x(f?1())log[jdet

J

f(f?1())j?1]jdet

J

f(f?1())j?1

d

(A.1)

Now, let us make the change of integration variable

=f

?1

() (A.2)

which gives us

H

(y )=?

Z

p

x()log[

p

x()]jdet

J

f()j?1jdet

J

f()j

d

? Z

p

x()log[jdet

J

f()j?1]jdet

J

f()j?1jdet

J

f()j

d

(A.3)

where the Jacobians cancel each other, and we have

H

(y )=?

Z

p

x()log[

p

x()]

d

+

Z

p

x()logjdet

J

f()j

d

(A.4)

which gives (5.13).

Now follow the proofs connected with the entropy approximations. First, we prove (5.41).

Due to the assumption of near-gaussianity, we can write

p

0(

)as

p

0(

)=

A

exp(?

2

=

2+

a

n+1

+(

a

n+2+1

=

2)

2+Xn

i=1

a

i

G

i(

))

;

(A.5)

APPENDIX 123

where in the exponential, all other terms are very small with respect to the first one. Thus, using the first-order approximationexp(

)1+

, we obtain

p

0(

)

A'

~ (

)(1+

a

n+1

+(

a

n+2+1

=

2)

2+Xn

i=1

a

i

G

i(

))

;

(A.6)

where

'

(

)=(2

)?1=2exp(?

2

=

2)is the standardized gaussian density, and

A

~=p2

A

.

Due to the orthogonality constraints in (5.39), the equations for solving

A

~and

a

ibecome

linear and almost diagonal:

Z

p

0(

)

d

=

A

~(1+(

a

n+2+1

=

2))=1 (A.7)

Z

p

0(

)

d

=

Aa

~ n+1=0 (A.8)

Z

p

0(

)

2

d

=

A

~(1+3(

a

n+2+1

=

2))=1 (A.9)

Z

p

0(

)

G

i(

)

d

=

Aa

~ i=

c

i

;

for

i

=1

;:::;n

(A.10)

and can be easily solved to yield

A

~=1

;a

n+1=0,

a

n+2=?1

=

2and

a

i=

c

i

;i

=1

;::;n

.

This gives (5.41).

Second, we prove (5.42). Using the Taylor expansion(1+

)log(1+

)=

+

2

=

2+

o

(

2),

one obtains

? Z

p

^(

)log

p

^(

)

d

(A.11)

=? Z

'

(

)(1+X

c

i

G

i(

))(log(1+X

c

i

G

i(

))+log

'

(

))

d

(A.12)

=? Z

'

(

)log

'

(

)?

Z

'

(

)X

c

i

G

i(

)log

'

(

) (A.13)

? Z

'

(

)[X

c

i

G

i(

)+1

2 (

X

c

i

G

i(

))2+

o

((X

c

i

G

i(

))2)]

(A.14)

=

H

(

)?0?0?1

2

X

c

2i +

o

((X

c

i)2) (A.15)

due to the orthogonality relationships in (5.39).

Finally, we prove (5.43), (5.47) and (5.48). First, we must orthonormalize the two functions

G

1and

G

2according to (5.39). To do this, it is enough to determine constants

1

;

1

;

2

;

2

;

2

so that the functions

F

1(

x

)=(

G

1(

x

)+

1

x

)

=

1 and

F

2(

x

)=(

G

2(

x

)+

2

x

2+

2)

=

2

are orthogonal to any second degree polynomials as in (5.39), and have unit norm in the metric defined by

'

. In fact, as will be seen below, this modification gives a

G

1that is odd and a

G

2that is even, and therefore the

G

iare automatically orthogonal with respect to each other.

Thus, first we solve the following equations:

Z

'

(

)

(

G

1(

)+

1

)

d

=0 (A.16)

Z

'

(

)

k(

G

2(

)+

2

2+

2)

d

=0

;

for

k

=0

;

2 (A.17)

A straightforward solution gives:

1=?

Z

'

(

)

G

1(

)

d

(A.18)

2=1

2 (

Z

'

(

)

G

2(

)

d

?

Z

'

(

)

G

2(

)

2

d

) (A.19)

2= 1

2 (

Z

'

(

)

G

2(

)

2

d

?3

Z

'

(

)

G

2(

)

d

) (A.20)

Next note that together with the standardization

R

'

(

)(

G

2(

)+

2

2+

2)

d

=0implies

c

i=

E

f

F

i(

x

)g=[

E

f

G

i(

x

)g?

E

f

G

i(

)g]

=

i (A.21)

This implies (5.43), with

k

2i =1

=

(2

i2). Thus we only need to determine explicitly the

ifor

each function. We solve the two equations

Z

'

(

)(

G

1(

)+

1

)2

=

1

d

=1 (A.22)

Z

'

(

)(

G

2(

)+

2

2+

2)2

=

2

d

=1 (A.23)

which, after some tedious manipulations, yield:

21=Z

'

(

)

G

1(

)2

d

?(Z

'

(

)

G

1(

)

d

)2 (A.24)

22=Z

'

(

)

G

2(

)2

d

?(Z

'

(

)

G

2(

)

d

)2

? 1

2 (

Z

'

(

)

G

2(

)

d

?

Z

'

(

)

G

2(

)

2

d

)2

:

(A.25)

Evaluating the

i for the given functions

G

i, one obtains (5.47) and (5.48) by the relation

k

i2=1

=

(2

i2).

6

Principal Component Analysis and Whitening

Principal component analysis (PCA) and the closely related Karhunen-Lo`eve trans-form, or the Hotelling transtrans-form, are classic techniques in statistical data analysis, feature extraction, and data compression, stemming from the early work of Pearson [364]. Given a set of multivariate measurements, the purpose is to find a smaller set of variables with less redundancy, that would give as good a representation as possible.

This goal is related to the goal of independent component analysis (ICA). However, in PCA the redundancy is measured by correlations between data elements, while in ICA the much richer concept of independence is used, and in ICA the reduction of the number of variables is given less emphasis. Using only the correlations as in PCA has the advantage that the analysis can be based on second-order statistics only.

In connection with ICA, PCA is a useful preprocessing step.

The basic PCA problem is outlined in this chapter. Both the closed-form solution and on-line learning algorithms for PCA are reviewed. Next, the related linear statistical technique of factor analysis is discussed. The chapter is concluded by presenting how data can be preprocessed by whitening, removing the effect of first-and second-order statistics, which is very helpful as the first step in ICA.

No documento Independent Component Analysis (páginas 142-147)