PRINCIPAL COMPONENTS

MATHEMATICAL PRELIMINARIES

6.1 PRINCIPAL COMPONENTS

6

Principal Component Analysis and Whitening

Principal component analysis (PCA) and the closely related Karhunen-Lo`eve trans-form, or the Hotelling transtrans-form, are classic techniques in statistical data analysis, feature extraction, and data compression, stemming from the early work of Pearson [364]. Given a set of multivariate measurements, the purpose is to find a smaller set of variables with less redundancy, that would give as good a representation as possible.

This goal is related to the goal of independent component analysis (ICA). However, in PCA the redundancy is measured by correlations between data elements, while in ICA the much richer concept of independence is used, and in ICA the reduction of the number of variables is given less emphasis. Using only the correlations as in PCA has the advantage that the analysis can be based on second-order statistics only.

In connection with ICA, PCA is a useful preprocessing step.

The basic PCA problem is outlined in this chapter. Both the closed-form solution and on-line learning algorithms for PCA are reviewed. Next, the related linear statistical technique of factor analysis is discussed. The chapter is concluded by presenting how data can be preprocessed by whitening, removing the effect of first-and second-order statistics, which is very helpful as the first step in ICA.

model is assumed for vector

x

. Typically the elements of

x

are measurements like pixel gray levels or values of a signal at different time instants. It is essential in PCA that the elements are mutually correlated, and there is thus some redundancy in

x

, making compression possible. If the elements are independent, nothing can be achieved by PCA.

In the PCA transform, the vector

x

is first centered by subtracting its mean:

x x

^?^E^f

x

The mean is in practice estimated from the available sample

x ⁽¹⁾ ^;:::; x ⁽ ^T ⁾

^(see

Chapter 4). Let us assume in the following that the centering has been done and thus E^f

x

^{= 0}

^{. Next,}

x

is linearly transformed to another vector

y

^with

^m

^elements,

m < n

, so that the redundancy induced by the correlations is removed. This is done by finding a rotated orthogonal coordinate system such that the elements of

x

in the new coordinates become uncorrelated. At the same time, the variances of the projections of

x

on the new coordinate axes are maximized so that the first axis corresponds to the maximal variance, the second axis corresponds to the maximal variance in the direction orthogonal to the first axis, and so on.

For instance, if

x

has a gaussian density that is constant over ellipsoidal surfaces in the

n

-dimensional space, then the rotated coordinate system coincides with the principal axes of the ellipsoid. A two-dimensional example is shown in Fig. 2.7 in Chapter 2. The principal components are now the projections of the data points on the two principal axes,

e

¹^and

e

². In addition to achieving uncorrelated components, the variances of the components (projections) also will be very different in most appli-cations, with a considerable number of the variances so small that the corresponding components can be discarded altogether. Those components that are left constitute the vector

y

As an example, take a set of

8

pixel windows from a digital image, an application that is considered in detail in Chapter 21. They are first transformed, e.g., using row-by-row scanning, into vectors

x

whose elements are the gray levels of the 64 pixels in the window. In real-time digital video transmission, it is essential to reduce this data as much as possible without losing too much of the visual quality, because the total amount of data is very large. Using PCA, a compressed representation vector

y

can be obtained from

x

, which can be stored or transmitted. Typically,

y

can have as few as 10 elements, and a good replica of the original

8

image window can still be reconstructed from it. This kind of compression is possible because neighboring elements of

x

, which are the gray levels of neighboring pixels in the digital image, are heavily correlated. These correlations are utilized by PCA, allowing almost the same information to be represented by a much smaller vector

y

. PCA is a linear technique, so computing

y

^from

x

is not heavy, which makes real-time processing possible.

PRINCIPAL COMPONENTS 127

6.1.1 PCA by variance maximization

In mathematical terms, consider a linear combination

y

=

^Xⁿ

k⁼¹

w

k¹

x

= w

^T¹

x

of the elements

x

;:::;x

nof the vector

x

^{. The}

^w

¹¹

^;:::;w

n¹are scalar coefficients or weights, elements of an

n

-dimensional vector

w

¹^{, and}

w

¹^T denotes the transpose of

w

¹^.

The factor

y

¹is called the first principal component of

x

, if the variance of

y

¹^is

maximally large. Because the variance depends on both the norm and orientation of the weight vector

w

¹ and grows without limits as the norm grows, we impose the constraint that the norm of

w

¹is constant, in practice equal to 1. Thus we look for a weight vector

w

¹maximizing the PCA criterion

J

¹^PCA

( w

^{) =}

^E^f

^y

¹²^g

⁼

^E^f

⁽ w

^T¹

x ⁾

²^g

⁼ w

^T¹^E^f

xx

^T^g

w

⁼ w

¹^T

C

w

¹ ^(6.1)

so that^k

w

¹^k

= 1

^(6.2)

There E^f

:

^gis the expectation over the (unknown) density of input vector

x

^{, and the}

norm of

w

¹is the usual Euclidean norm defined as

w

¹^k

^{= (} w

^T¹

w

⁾

¹⁼²

^{= [}

^Xⁿ

k⁼¹

w

²_k¹

]

¹⁼²

The matrix

C

^xin Eq. (6.1) is the

n

covariance matrix of

x

(see Chapter 4) given for the zero-mean vector

x

by the correlation matrix

C

⁼

^E^f

xx

^T^g ^(6.3)

It is well known from basic linear algebra (see, e.g., [324, 112]) that the solution to the PCA problem is given in terms of the unit-length eigenvectors

e

;:::; e

n ^of the matrix

C

^x. The ordering of the eigenvectors is such that the corresponding eigenvalues

d

;:::;d

n ^satisfy

d

:::

d

n. The solution maximizing (6.1) is given by

w

⁼ e

Thus the first principal component of

x

^is

^y

= e

^T¹

x

The criterion

J

¹^PCAin eq. (6.1) can be generalized to

m

principal components, with

m

any number between 1 and

n

. Denoting the

m

^{-th (}

1 m

n

) principal component by

y

= w

x

^{, with}

w

mthe corresponding unit norm weight vector, the variance of

y

mis now maximized under the constraint that

y

mis uncorrelated with all the previously found principal components:

E^f

y

k^g

= 0 ; k < m:

^(6.4)

Note that the principal components

y

_mhave zero means because E^f

y

m^g

= w

Tm^E^f

x

^{= 0}

The condition (6.4) yields:

E^f

y

_k^g

=

^E^f

( w

x )( w

x )

= w

C

w

= 0

^(6.5)

For the second principal component, we have the condition that

w

²^T

Cw

= d

w

²^T

e

= 0

^(6.6)

because we already know that

w

⁼ e

¹. We are thus looking for maximal variance E^f

y

²²^g

=

^E^f

( w

²^T

x ⁾

²^gin the subspace orthogonal to the first eigenvector of

C

^x^{. The}

solution is given by

w

= e

Likewise, recursively it follows that

w

= e

k Thus the

k

th principal component is

y

= e

x

Exactly the same result for the

w

i is obtained if the variances of

y

i ^are maxi-mized under the constraint that the principal component vectors are orthonormal, or

w

=

ij. This is left as an exercise.

6.1.2 PCA by minimum mean-square error compression

In the preceding subsection, the principal components were defined as weighted sums of the elements of

x

with maximal variance, under the constraints that the weights are normalized and the principal components are uncorrelated with each other. It turns out that this is strongly related to minimum mean-square error compression of

x

, which is another way to pose the PCA problem. Let us search for a set of

m

orthonormal basis vectors, spanning an

m

-dimensional subspace, such that the mean-square error between

x

and its projection on the subspace is minimal. Denoting again the basis vectors by

w

^;:::; w

m, for which we assume

w

=

ij the projection of

x

on the subspace spanned by them is

Pmi⁼¹

( w

x ) w

i. The mean-square error (MSE) criterion, to be minimized by the orthonormal basis

w

^;:::; w

m^, becomes

J

_MSE^PCA

=

^E^fk

x

^?^X^m

i⁼¹

( w

_Ti

x ) w

i^k²^g ^(6.7) It is easy to show (see exercises) that due to the orthogonality of the vectors

w

i^{, this} criterion can be further written as

J

_MSE^PCA

=

^E^fk

x

^k²^g^?^E^f^X^m

j⁼¹

( w

x ⁾

²^g ^(6.8)

=

^trace

( C

⁾

^?^X^m

j⁼¹

w

C

w

j ^(6.9)

PRINCIPAL COMPONENTS 129

It can be shown (see, e.g., [112]) that the minimum of (6.9) under the orthonor-mality condition on the

w

i is given by any orthonormal basis of the PCA subspace spanned by the

m

first eigenvectors

e

^;:::; e

m. However, the criterion does not spec-ify the basis of this subspace at all. Any orthonormal basis of the subspace will give the same optimal compression. While this ambiguity can be seen as a disadvantage, it should be noted that there may be some other criteria by which a certain basis in the PCA subspace is to be preferred over others. Independent component analysis is a prime example of methods in which PCA is a useful preprocessing step, but once the vector

x

has been expressed in terms of the first

m

eigenvectors, a further rotation brings out the much more useful independent components.

It can also be shown [112] that the value of the minimum mean-square error of (6.7) is

J

_MSE^PCA

=

^Xⁿ

i⁼m⁺¹

d

i ^(6.10)

the sum of the eigenvalues corresponding to the discarded eigenvectors

e

m⁺¹

;:::; e

n^. If the orthonormality constraint is simply changed to

w

= !

jk ^(6.11)

where all the numbers

!

k are positive and different, then the mean-square error problem will have a unique solution given by scaled eigenvectors [333].

6.1.3 Choosing the number of principal components

From the result that the principal component basis vectors

w

iare eigenvectors

e

i^of

C

^x, it follows that

E^f

y

²_m^g

=

^E^f

e

xx

e

m^g

= e

C

e

= d

m ^(6.12) The variances of the principal components are thus directly given by the eigenvalues of

C

^x. Note that, because the principal components have zero means, a small eigen-value (a small variance)

d

mindicates that the value of the corresponding principal component

y

mis mostly close to zero.

An important application of PCA is data compression. The vectors

x

in the original data set (that have first been centered by subtracting the mean) are approximated by the truncated PCA expansion

x ^ ⁼

^X^m

i⁼¹

y

e

i ^(6.13)

Then we know from (6.10) that the mean-square error E^fk

x

^{^} x

^k²^gis equal to

Pni⁼m⁺¹

d

i. As the eigenvalues are all positive, the error decreases when more and more terms are included in (6.13), until the error becomes zero when

m = n

^{or all}

the principal components are included. A very important practical problem is how to

choose

m

in (6.13); this is a trade-off between error and the amount of data needed for the expansion. Sometimes a rather small number of principal components are sufficient.

Fig. 6.1 Leftmost column: some digital images in a³²³²grid. Second column: means of the samples. Remaining columns: reconstructions by PCA when 1, 2, 5, 16, 32, and 64 principal components were used in the expansion.

Example 6.1 In digital image processing, the amount of data is typically very large, and data compression is necessary for storage, transmission, and feature extraction.

PCA is a simple and efficient method. Fig. 6.1 shows 10 handwritten characters that were represented as binary

32

matrices (left column) [183]. Such images, when scanned row by row, can be represented as 1024-dimensional vectors. For each of the 10 character classes, about 1700 handwritten samples were collected, and the sample means and covariance matrices were computed by standard estimation methods. The covariance matrices were

1024

matrices. For each class, the first 64 principal component vectors or eigenvectors of the covariance matrix were computed. The second column in Fig. 6.1 shows the sample means, and the other columns show the reconstructions (6.13) for various values of

m

. In the reconstructions, the sample means have been added again to scale the images for visual display. Note how a relatively small percentage of the 1024 principal components produces reasonable reconstructions.

PRINCIPAL COMPONENTS 131

The condition (6.12) can often be used in advance to determine the number of principal components

m

, if the eigenvalues are known. The eigenvalue sequence

d

;d

;:::;d

n of a covariance matrix for real-world measurement data is usually sharply decreasing, and it is possible to set a limit below which the eigenvalues, hence principal components, are insignificantly small. This limit determines how many principal components are used.

Sometimes the threshold can be determined from some prior information on the vectors

x

. For instance, assume that

x

obeys a signal-noise model

x =

^X^m

i⁼¹

a

s

+ n

^(6.14)

where

m < n

^{. There}

a

iare some fixed vectors and the coefficients

s

i ^{are random} numbers that are zero mean and uncorrelated. We can assume that their variances have been absorbed in vectors

a

i so that they have unit variances. The term

n

^is

white noise, for which E^f

nn

^T^g

=

I

. Then the vectors

a

ispan a subspace, called the signal subspace, that has lower dimensionality than the whole space of vectors

x

. The subspace orthogonal to the signal subspace is spanned by pure noise and it is called the noise subspace.

It is easy to show (see exercises) that in this case the covariance matrix of

x

^{has a}

special form:

C

⁼

^X^m

i⁼¹

a

+

I

^(6.15)

The eigenvalues are now the eigenvalues of

Pmi⁼¹

a

Ti, added by the constant

²^.

But the matrix

Pmi⁼¹

a

Ti has at most

m

nonzero eigenvalues, and these correspond to eigenvectors that span the signal subspace. When the eigenvalues of

C

^x ^are

computed, the first

m

form a decreasing sequence and the rest are small constants, equal to

²^:

d

> d

> ::: > d

> d

m⁺¹

= d

m⁺²

= ::: = d

=

It is usually possible to detect where the eigenvalues become constants, and putting a threshold at this index,

m

, cuts off the eigenvalues and eigenvectors corresponding to pure noise. Then only the signal part remains.

A more disciplined approach to this problem was given by [453]; see also [231].

They give formulas for two well-known information theoretic modeling criteria, Akaike’s information criterion (AIC) and the minimum description length criterion (MDL), as functions of the signal subspace dimension

m

. The criteria depend on the length

T

of the sample

x (1) ;:::; x ( T )

and on the eigenvalues

d

;:::;d

nof the matrix

C

^x. Finding the minimum point gives a good value for

m

6.1.4 Closed-form computation of PCA

To use the closed-form solution

w

= e

igiven earlier for the PCA basis vectors, the eigenvectors of the covariance matrix

C

^xmust be known. In the conventional use of

PCA, there is a sufficiently large sample of vectors

x

available, from which the mean and the covariance matrix

C

^xcan be estimated by standard methods (see Chapter 4).

Solving the eigenvector–eigenvalue problem for

C

^xgives the estimate for

e

¹^{. There}

are several efficient numerical methods available for solving the eigenvectors, e.g., the QR algorithm with its variants [112, 153, 320].

However, it is not always feasible to solve the eigenvectors by standard numerical methods. In an on-line data compression application like image or speech coding, the data samples

x ( t )

arrive at high speed, and it may not be possible to estimate the covariance matrix and solve the eigenvector–eigenvalue problem once and for all.

One reason is computational: the eigenvector problem is numerically too demanding if the dimensionality

n

is large and the sampling rate is high. Another reason is that the covariance matrix

C

^xmay not be stationary, due to fluctuating statistics in the sample sequence

x ( t )

, so the estimate would have to be incrementally updated. Therefore, the PCA solution is often replaced by suboptimal nonadaptive transformations like the discrete cosine transform [154].

No documento Independent Component Analysis (páginas 147-154)

MATHEMATICAL PRELIMINARIES

6.1 PRINCIPAL COMPONENTS

6

Principal Component Analysis and Whitening

x

x

x

x

x x

x

x (1) ;:::; x ( T )

x

= 0

x

y

m

m < n

x

x

x

n

e

e

y

8

8

x

y

x

y

8

8

x

y

y

x

y

=

w

x

= w

x

x

;:::;x

x

w

;:::;w

n

w

w

w

y

x

y

w

w

w

J

( w

) =

y

=

( w

x )

= w

xx

w

= w

C

w

w

= 1

:

x

w

w

= ( w

w

)

x ⁽¹⁾ ^;:::; x ⁽ ^T ⁾

^{= 0}

^m

^w

^;:::;w

^{) =}

^y

⁼

⁽ w

x ⁾

⁼ w

⁼ w

^{= (} w

⁾

^{= [}

⁼

⁼ e

^y

^{= 0}

⁼ e

x ⁾