• Nenhum resultado encontrado

@J NIC ( W )

No documento Independent Component Analysis (páginas 160-167)

Recently, Miao and Hua [300] proposed such a criterion: again denoting by

W =

( w

1

::: w

m

)

T the matrix whose rows are the weight vectors, this “novel information criterion” (NIC) is

FACTOR ANALYSIS 139

are generally not assumed to be equal or infinitely small, as in the special case of principal FA. We can write the covariance matrix of the observations from (6.30) as

Ef

xx

Tg

= C

x

= AA

T

+ Q

(6.32)

In practice, we have a good estimate of

C

x available, given by the sample co-variance matrix. The main problem is then to solve the matrix

A

of factor loadings and the diagonal noise covariance matrix

Q

such that they will explain the observed covariances from (6.32). There is no closed-form analytic solution for

A

and

Q

.

Assuming

Q

is known or can be estimated, we can attempt to solve

A

from

AA

T

= C

x?

Q

. The number of factors is usually constrained to be much smaller than the number of dimensions in the data, so this equation cannot be exactly solved;

something similar to a least-squares solution should be used instead. Clearly, this problem does not have a unique solution: any orthogonal transform or rotation of

A

!

AT

, with

T

an orthogonal matrix (for which

TT

T

= I

), will produce exactly the same left-hand side. We need some extra constraints to make the problem more unique.

Now, looking for a factor-based interpretation of the observed variables, FA typ-ically tries to solve the matrix

A

in such a way that the variables would have high loadings on a small number of factors, and very low loadings on the remaining fac-tors. The results are then easier to interpret. This principle has been used in such techniques as varimax, quartimax, and oblimin rotations. Several classic techniques for such factor rotations are covered by Harman [166].

There are some important differences between PCA, FA, and ICA. Principal component analysis is not based on a generative model, although it can be derived from one. It is a linear transformation that is based either on variance maximization or minimum mean-square error representation. The PCA model is invertible in the (theoretical) case of no compression, i.e., when all the principal components are retained. Once the principal components

y

ihave been found, the original observations can be readily expressed as their linear functions as

x =

Pni=1

y

i

w

i, and also the principal components are simply obtained as linear functions of the observations:

y

i

= w

Ti

x

.

The FA model is a generative latent variable model; the observations are expressed in terms of the factors, but the values of the factors cannot be directly computed from the observations. This is due to the additive term of specific factors or noise which is considered important in some application fields. Further, the rows of matrix

A

are generally not (proportional to) eigenvectors of

C

x; several different estimation methods exist.

FA, as well as PCA, is a purely second-order statistical method: only covariances between the observed variables are used in the estimation, which is due to the assumption of gaussianity of the factors. The factors are further assumed to be uncorrelated, which also implies independence in the case of gaussian data. ICA is a similar generative latent variable model, but now the factors or independent components are assumed to be statistically independent and nongaussian — a much stronger assumption that removes the rotational redundancy of the FA model. In fact, ICA can be considered as one particular method of determining the factor rotation.

The noise term is usually omitted in the ICA model; see Chapter 15 for a detailed discussion on this point.

6.4 WHITENING

As already discussed in Chapter 1, the ICA problem is greatly simplified if the observed mixture vectors are first whitened or sphered. A zero-mean random vect or

z = ( z

1

:::z

n

)

T is said to be white if its elements

z

i are uncorrelated and have unit variances:

Ef

z

i

z

jg

=

ij

In terms of the covariance matrix, this obviously means that Ef

zz

Tg

= I

, with

I

the unit matrix. The best-known example is white noise; then the elements

z

iwould be the intensities of noise at consequent time points

i = 1 ; 2 ;:::

and there are no temporal correlations in the noise process. The term “white” comes from the fact that the power spectrum of white noise is constant over all frequencies, somewhat like the spectrum of white light contains all colors.

A synonym for white is sphered. If the density of the vector

z

is radially symmetric and suitably scaled, then it is sphered. An example is the multivariate gaussian density that has zero mean and unit covariance matrix. The opposite does not hold:

the density of a sphered vector does not have to be radially symmetric. An example is a two-dimensional uniform density that has the shape of a rotated square; see Fig. 7.10. It is easy to see that in this case both the variables

z

1 and

z

2 on the

coordinate axes have unit variance (if the side of the square has length

2

p

3

) and they

are uncorrelated, independently of the rotation angle. Thus vector

z

is sphered, even if the density is highly nonsymmetric. Note that the densities of the elements

z

iof a sphered random vector need not be the same.

Because whitening is essentially decorrelation followed by scaling, the technique of PCA can be used. This implies that whitening can be done with a linear operation.

The problem of whitening is now: Given a random vector

x

with

n

elements, find a linear transformation

V

into another vector

z

such that

z = Vx

is white (sphered).

The problem has a straightforward solution in terms of the PCA expansion. Let

E = ( e

1

::: e

n

)

be the matrix whose columns are the unit-norm eigenvectors of the covariance matrix

C

x

=

Ef

xx

Tg. These can be computed from a sample of the vectors

x

either directly or by one of the on-line PCA learning rules. Let

D =

diag

( d

1

:::d

n

)

be the diagonal matrix of the eigenvalues of

C

. Then a linear whitening transform is given by

V = D

?1=2

E

T (6.33)

This matrix always exists when the eigenvalues

d

iare positive; in practice, this is not a restriction. Remember (see Chapter 4) that

C

xis positive semidefinite, in practice positive definite for almost any natural data, so its eigenvalues will be positive.

ORTHOGONALIZATION 141

It is easy to show that the matrix

V

of Eq. (6.33) is indeed a whitening trans-formation. Recalling that

C

xcan be written in terms of its eigenvector and eigen-value matrices

E

and

D

as

C

x

= EDE

T, with

E

an orthogonal matrix satisfying

E

T

E = EE

T

= I

, it holds:

Ef

zz

Tg

= V

Ef

xx

Tg

V

T

= D

?1=2

E

T

EDE

T

ED

?1=2

= I

The covariance of

z

is the unit matrix, hence

z

is white.

The linear operator

V

of (6.33) is by no means the only unique whitening matrix.

It is easy to see that any matrix

UV

, with

U

an orthogonal matrix, is also a whitening matrix. This is because for

z = UVx

it holds:

Ef

zz

Tg

= UV

Ef

xx

Tg

V

T

U

T

= UIU

T

= I

An important instance is the matrix

ED

?1=2

E

T. This is a whitening matrix because it is obtained by multiplying

V

of Eq. (6.33) from the left by the orthogonal matrix

E

. This matrix is called the inverse square root of

C

x, and denoted by

C

?1x =2,

because it comes from the standard extension of square roots to matrices.

It is also possible to perform whitening by on-line learning rules, similar to the PCA learning rules reviewed earlier. One such direct rule is

V = ( I

?

Vxx

T

V

T

) V = ( I

?

zz

T

) V

(6.34)

It can be seen that at a stationary point, when the change in the value of

V

is zero on

the average, it holds

( I

?Ef

zz

Tg

) V = 0

for which a whitened

z = Vx

is a solution. It can be shown (see, e.g., [71]) that the algorithm will indeed converge to a whitening transformation

V

.

6.5 ORTHOGONALIZATION

In some PCA and ICA algorithms, we know that in theory the solution vectors (PCA basis vectors or ICA basis vectors) are orthogonal or orthonormal, but the iterative algorithms do not always automatically produce orthogonality. Then it may be necessary to orthogonalize the vectors after each iteration step, or at some suitable intervals. In this subsection, we look into some basic orthogonalization methods.

Simply stated, the problem is as follows: given a set of

n

-dimensional linearly independent vectors

a

1

;:::; a

m, with

m

n

, compute another set of

m

vectors

w

1

;:::; w

m that are orthogonal or orthonormal (i.e., orthogonal and having unit Euclidean norm) and that span the same subspace as the original vectors. This means that each

w

iis some linear combination of the

a

j.

The classic approach is the Gram-Schmidt orthogonalization (GSO) method [284]:

w

1

= a

1 (6.35)

w

j

= a

j?jX?1 i=1

w

Ti

a

j

w

Ti

w

i

w

i (6.36)

As a result,

w

Ti

w

j

= 0

for

i

6

= j

, as is easy to show by induction. Assume that the first

j

?

1

basis vectors are already orthogonal; from (6.36) it then follows for any

k < j

that

w

Tk

w

j

= w

Tk

a

j?Pj?1

i=1 wwTiTiwaji

( w

Tk

w

i

)

. In the sum, all the inner products

w

Tk

w

iare zero except the one where

i = k

. This term becomes equal to

wTkaj

wTkwk

( w

Tk

w

k

) = w

Tk

a

j, and thus the inner product

w

Tk

w

jis zero, too.

If in the GSO each

w

jis further divided by its norm, the set will be orthonormal.

The GSO is a sequential orthogonalization procedure. It is the basis of deflation approaches to PCA and ICA. A problem with sequential orthogonalization is the cumulation of errors.

In symmetric orthonormalization methods, none of the original vectors

a

iis treated differently from the others. If it is sufficient to find any orthonormal basis for the subspace spanned by the original vectors, without other constraints on the new vectors, then this problem does not have a unique solution. This can be accomplished for instance by first forming the matrix

A = ( a

1

::: a

m

)

whose columns are the vectors to be orthogonalized, then computing

( A

T

A )

?1=2using the eigendecomposition of the symmetric matrix

( A

T

A )

, and finally putting

W = A ( A

T

A )

?1=2 (6.37)

Obviously, for matrix

W

it holds

W

T

W = I

, and its columns

w

1

;:::; w

m span the same subspace as the columns of matrix

A

. These vectors are thus a suitable orthonormalized basis. This solution to the symmetric orthonormalization problem is by no means unique; again, any matrix

WU

with

U

an orthogonal matrix will do quite as well.

However, among these solutions, there is one specific orthogonal matrix that is closest to matrix

A

(in an appropriate matrix norm). Then this matrix is the orthogonal projection of

A

onto the set of orthogonal matrices [284]. This is somewhat analogous to the normalization of one vector

a

; the vector

a =

k

a

kis the projection of

a

onto

the set of unit-norm vectors (the unit sphere). For matrices, it can be shown that the matrix

A ( A

T

A )

?1=2in Eq. (6.37) is in fact the unique orthogonal projection of

A

onto this set.

This orthogonalization should be preferred in gradient algorithms that minimize a functionJ

( W )

under the constraint

W

T

W = I

. As explained in Chapter 3, one iteration step consists of two parts: first, the matrix

W

is updated by the usual gradient descent, and second, the updated matrix is projected orthogonally onto the constraint set. For this second stage, the form given in (6.37) for orthogonalizing the updated matrix should be used.

There are iterative methods for symmetric orthonormalization that avoid the matrix eigendecomposition and inversion. An example is the following iterative algorithm [197], starting from a nonorthogonal matrix

W (0)

:

W (1) = W (0) =

k

W (0)

k

;

(6.38)

W ( t + 1) = 32 W ( t )

?

1 2 W ( t ) W ( t )

T

W ( t )

(6.39)

CONCLUDING REMARKS AND REFERENCES 143

The iteration is continued until

W ( t )

T

W ( t )

I

. The convergence of this iteration can be proven as follows [197]: matrices

W ( t )

T

W ( t )

and

W ( t + 1)

T

W ( t +

1) =

94

W ( t )

T

W ( t )

?32

[ W ( t )

T

W ( t )]

2

+

14

[ W ( t )

T

W ( t )]

3have clearly the same eigenvectors, and the relation between the eigenvalues is

d ( t + 1) = 94 d ( t )

?

3

2 d

2

( t ) + 14 d

3

( t )

(6.40)

This nonlinear scalar iteration will converge on the interval

[0 ; 1]

to 1 (see exercises).

Due to the original normalization, all the eigenvalues are on this interval, assuming that the norm in the normalization is appropriately chosen (it must be a proper norm in the space of matrices; most conventional norms, except for the Frobenius norm, have this property). Because the eigenvalues tend to 1, the matrix itself tends to the unit matrix.

6.6 CONCLUDING REMARKS AND REFERENCES

Good general discussions on PCA are [14, 109, 324, 112]. The variance maximiza-tion criterion of PCA covered in Secmaximiza-tion 6.1.1 is due to Hotelling [185], while in the original work by Pearson [364], the starting point was minimizing the squared reconstruction error (Section 6.1.2). These are not the only criteria leading to the PCA solution; yet another information-theoretic approach is maximization of mutual information between the inputs and outputs in a linear gaussian channel [112]. An expansion closely related to PCA is the Karhunen-Lo`eve expansion for continuous second-order stochastic processes, whose autocovariance function can be expanded in terms of its eigenvalues and orthonormal eigenfunctions in a convergent series [237, 283].

The on-line algorithms of Section 6.2 are especially suitable for neural network implementations. In numerical analysis and signal processing, many other adaptive algorithms of varying complexity have been reported for different computing hard-ware. A good review is given by Comon and Golub [92]. Experimental results on PCA algorithms both for finding the eigenvectors of stationary training sets, and for tracking the slowly changing eigenvectors of nonstationary input data streams, have been reported in [324, 391, 350]. An obvious extension of PCA neural networks would be to use nonlinear units, e.g., perceptrons, instead of the linear units. It turns out that such “nonlinear PCA” networks will in some cases give the independent components of the input vectors, instead of just uncorrelated components [232, 233]

(see Chapter 12).

Good general texts on factor analysis are [166, 243, 454]. The principal FA model has been recently discussed by [421] and [387].

Problems

6.1 Consider the problem of maximizing the variance of

y

m

= w

Tm

x ( m =

1 ;:::;n )

under the constraint that

w

mmust be of unit Euclidean norm and orthogonal to all the previously-found principal vectors

w

i

; i < m

. Show that the solution is given by

w

m

= e

mwith

e

mthe eigenvector of

C

xcorresponding to the

m

th largest

eigenvalue.

6.2 Show that the criterion (6.9) is equivalent to the mean-square error (6.7). Show that at the optimum, if

w

i

= e

i, the value of (6.7) is given by (6.10).

6.3 Given the data model (6.14), show that the covariance matrix has the form (6.15).

6.4 The learning rule for a PCA neuron is based on maximization of

y = ( w

T

x )

2

under constraintk

w

k

= 1

. (We have now omitted the subscript 1 because only one neuron is involved.)

6.4.1. Show that an unlimited gradient ascent method would compute the new vector

w

from

w w + ( w

T

x ) x

with

the learning rate. Show that the norm of the weight vector always grows in this case.

6.4.2. Thus the norm must be bounded. A possibility is the following update rule:

w [ w + ( w

T

x ) x ] =

k

w + ( w

T

x ) x

k

Now the norm will stay equal to 1. Derive an approximation to this update rule for a small value of

, by taking a Taylor expansion of the right-hand side with respect to

and dropping all higher powers of

. Leave only terms linear in

. Show that the result is

w w + [( w

T

x ) x

?

( w

T

x )

2

w ]

which is the basic PCA learning rule of Eq. (6.16).

6.4.3. Take averages with respect to the random input vector

x

and show that in a stationary point of the iteration, where there is no change on the average in the value of

w

, it holds:

C

x

w = ( w

T

C

x

w ) w

with

C

x

=

Ef

xx

Tg.

6.4.4. Show that the only possible solutions will be the eigenvectors of

C

x.

6.5 The covariance matrix of vector

x

is

C

x

=

2 : 5 1 : 5 1 : 5 2 : 5

(6.41) Compute a whitening transformation for

x

.

6.6 * Based on the first step (6.38) of the orthogonalization algorithm, show that

0 < d (1) < 1

where

d (1)

is any eigenvalue of

W (1)

. Next consider iteration (6.40).

Write

d ( t + 1)

?

1

in terms of

d ( t )

?

1

. Show that

d ( t )

converges to 1. How fast is the convergence?

Part II

BASIC INDEPENDENT

No documento Independent Component Analysis (páginas 160-167)