MATHEMATICAL PRELIMINARIES
6.1 PRINCIPAL COMPONENTS
6
Principal Component Analysis and Whitening
Principal component analysis (PCA) and the closely related Karhunen-Lo`eve trans-form, or the Hotelling transtrans-form, are classic techniques in statistical data analysis, feature extraction, and data compression, stemming from the early work of Pearson [364]. Given a set of multivariate measurements, the purpose is to find a smaller set of variables with less redundancy, that would give as good a representation as possible.
This goal is related to the goal of independent component analysis (ICA). However, in PCA the redundancy is measured by correlations between data elements, while in ICA the much richer concept of independence is used, and in ICA the reduction of the number of variables is given less emphasis. Using only the correlations as in PCA has the advantage that the analysis can be based on second-order statistics only.
In connection with ICA, PCA is a useful preprocessing step.
The basic PCA problem is outlined in this chapter. Both the closed-form solution and on-line learning algorithms for PCA are reviewed. Next, the related linear statistical technique of factor analysis is discussed. The chapter is concluded by presenting how data can be preprocessed by whitening, removing the effect of first-and second-order statistics, which is very helpful as the first step in ICA.
model is assumed for vector
x
. Typically the elements ofx
are measurements like pixel gray levels or values of a signal at different time instants. It is essential in PCA that the elements are mutually correlated, and there is thus some redundancy inx
, making compression possible. If the elements are independent, nothing can be achieved by PCA.In the PCA transform, the vector
x
is first centered by subtracting its mean:x x
?Efx
gThe mean is in practice estimated from the available sample
x (1) ;:::; x ( T )
(seeChapter 4). Let us assume in the following that the centering has been done and thus Ef
x
g= 0
. Next,x
is linearly transformed to another vectory
withm
elements,m < n
, so that the redundancy induced by the correlations is removed. This is done by finding a rotated orthogonal coordinate system such that the elements ofx
in the new coordinates become uncorrelated. At the same time, the variances of the projections ofx
on the new coordinate axes are maximized so that the first axis corresponds to the maximal variance, the second axis corresponds to the maximal variance in the direction orthogonal to the first axis, and so on.For instance, if
x
has a gaussian density that is constant over ellipsoidal surfaces in then
-dimensional space, then the rotated coordinate system coincides with the principal axes of the ellipsoid. A two-dimensional example is shown in Fig. 2.7 in Chapter 2. The principal components are now the projections of the data points on the two principal axes,e
1ande
2. In addition to achieving uncorrelated components, the variances of the components (projections) also will be very different in most appli-cations, with a considerable number of the variances so small that the corresponding components can be discarded altogether. Those components that are left constitute the vectory
.As an example, take a set of
8
8
pixel windows from a digital image, an application that is considered in detail in Chapter 21. They are first transformed, e.g., using row-by-row scanning, into vectorsx
whose elements are the gray levels of the 64 pixels in the window. In real-time digital video transmission, it is essential to reduce this data as much as possible without losing too much of the visual quality, because the total amount of data is very large. Using PCA, a compressed representation vectory
can be obtained from
x
, which can be stored or transmitted. Typically,y
can have as few as 10 elements, and a good replica of the original8
8
image window can still be reconstructed from it. This kind of compression is possible because neighboring elements ofx
, which are the gray levels of neighboring pixels in the digital image, are heavily correlated. These correlations are utilized by PCA, allowing almost the same information to be represented by a much smaller vectory
. PCA is a linear technique, so computingy
fromx
is not heavy, which makes real-time processing possible.PRINCIPAL COMPONENTS 127
6.1.1 PCA by variance maximization
In mathematical terms, consider a linear combination
y
1=
Xnk=1
w
k1x
k= w
T1x
of the elements
x
1;:::;x
nof the vectorx
. Thew
11;:::;w
n1are scalar coefficients or weights, elements of ann
-dimensional vectorw
1, andw
1T denotes the transpose ofw
1.The factor
y
1is called the first principal component ofx
, if the variance ofy
1ismaximally large. Because the variance depends on both the norm and orientation of the weight vector
w
1 and grows without limits as the norm grows, we impose the constraint that the norm ofw
1is constant, in practice equal to 1. Thus we look for a weight vectorw
1maximizing the PCA criterionJ
1PCA( w
1) =
Efy
12g=
Ef( w
T1x )
2g= w
T1Efxx
Tgw
1= w
1TC
xw
1 (6.1)so thatk
w
1k= 1
(6.2)There Ef
:
gis the expectation over the (unknown) density of input vectorx
, and thenorm of
w
1is the usual Euclidean norm defined ask
w
1k= ( w
T1w
1)
1=2= [
Xnk=1
w
2k1]
1=2The matrix
C
xin Eq. (6.1) is then
n
covariance matrix ofx
(see Chapter 4) given for the zero-mean vectorx
by the correlation matrixC
x=
Efxx
Tg (6.3)It is well known from basic linear algebra (see, e.g., [324, 112]) that the solution to the PCA problem is given in terms of the unit-length eigenvectors
e
1;:::; e
n of the matrixC
x. The ordering of the eigenvectors is such that the corresponding eigenvaluesd
1;:::;d
n satisfyd
1d
2:::
d
n. The solution maximizing (6.1) is given byw
1= e
1Thus the first principal component of
x
isy
1= e
T1x
.The criterion
J
1PCAin eq. (6.1) can be generalized tom
principal components, withm
any number between 1 andn
. Denoting them
-th (1
m
n
) principal component byy
m= w
Tmx
, withw
mthe corresponding unit norm weight vector, the variance ofy
mis now maximized under the constraint thaty
mis uncorrelated with all the previously found principal components:Ef
y
my
kg= 0 ; k < m:
(6.4)Note that the principal components
y
mhave zero means because Efy
mg= w
TmEfx
g= 0
The condition (6.4) yields:
Ef
y
my
kg=
Ef( w
Tmx )( w
Tkx )
g= w
TmC
xw
k= 0
(6.5)For the second principal component, we have the condition that
w
2TCw
1= d
1w
2Te
1= 0
(6.6)because we already know that
w
1= e
1. We are thus looking for maximal variance Efy
22g=
Ef( w
2Tx )
2gin the subspace orthogonal to the first eigenvector ofC
x. Thesolution is given by
w
2= e
2Likewise, recursively it follows that
w
k= e
k Thus thek
th principal component isy
k= e
Tkx
.Exactly the same result for the
w
i is obtained if the variances ofy
i are maxi-mized under the constraint that the principal component vectors are orthonormal, orw
Tiw
j=
ij. This is left as an exercise.6.1.2 PCA by minimum mean-square error compression
In the preceding subsection, the principal components were defined as weighted sums of the elements of
x
with maximal variance, under the constraints that the weights are normalized and the principal components are uncorrelated with each other. It turns out that this is strongly related to minimum mean-square error compression ofx
, which is another way to pose the PCA problem. Let us search for a set ofm
orthonormal basis vectors, spanning an
m
-dimensional subspace, such that the mean-square error betweenx
and its projection on the subspace is minimal. Denoting again the basis vectors byw
1;:::; w
m, for which we assumew
Tiw
j=
ij the projection ofx
on the subspace spanned by them isPmi=1
( w
Tix ) w
i. The mean-square error (MSE) criterion, to be minimized by the orthonormal basisw
1;:::; w
m, becomesJ
MSEPCA=
Efkx
?Xmi=1
( w
Tix ) w
ik2g (6.7) It is easy to show (see exercises) that due to the orthogonality of the vectorsw
i, this criterion can be further written asJ
MSEPCA=
Efkx
k2g?EfXmj=1
( w
Tjx )
2g (6.8)=
trace( C
x)
?Xmj=1
w
TjC
xw
j (6.9)PRINCIPAL COMPONENTS 129
It can be shown (see, e.g., [112]) that the minimum of (6.9) under the orthonor-mality condition on the
w
i is given by any orthonormal basis of the PCA subspace spanned by them
first eigenvectorse
1;:::; e
m. However, the criterion does not spec-ify the basis of this subspace at all. Any orthonormal basis of the subspace will give the same optimal compression. While this ambiguity can be seen as a disadvantage, it should be noted that there may be some other criteria by which a certain basis in the PCA subspace is to be preferred over others. Independent component analysis is a prime example of methods in which PCA is a useful preprocessing step, but once the vectorx
has been expressed in terms of the firstm
eigenvectors, a further rotation brings out the much more useful independent components.It can also be shown [112] that the value of the minimum mean-square error of (6.7) is
J
MSEPCA=
Xni=m+1
d
i (6.10)the sum of the eigenvalues corresponding to the discarded eigenvectors
e
m+1;:::; e
n. If the orthonormality constraint is simply changed tow
Tjw
k= !
kjk (6.11)where all the numbers
!
k are positive and different, then the mean-square error problem will have a unique solution given by scaled eigenvectors [333].6.1.3 Choosing the number of principal components
From the result that the principal component basis vectors
w
iare eigenvectorse
iofC
x, it follows thatEf
y
2mg=
Efe
Tmxx
Te
mg= e
TmC
xe
m= d
m (6.12) The variances of the principal components are thus directly given by the eigenvalues ofC
x. Note that, because the principal components have zero means, a small eigen-value (a small variance)d
mindicates that the value of the corresponding principal componenty
mis mostly close to zero.An important application of PCA is data compression. The vectors
x
in the original data set (that have first been centered by subtracting the mean) are approximated by the truncated PCA expansionx ^ =
Xmi=1
y
ie
i (6.13)Then we know from (6.10) that the mean-square error Efk
x
?^ x
k2gis equal toPni=m+1
d
i. As the eigenvalues are all positive, the error decreases when more and more terms are included in (6.13), until the error becomes zero whenm = n
or allthe principal components are included. A very important practical problem is how to
choose
m
in (6.13); this is a trade-off between error and the amount of data needed for the expansion. Sometimes a rather small number of principal components are sufficient.Fig. 6.1 Leftmost column: some digital images in a3232grid. Second column: means of the samples. Remaining columns: reconstructions by PCA when 1, 2, 5, 16, 32, and 64 principal components were used in the expansion.
Example 6.1 In digital image processing, the amount of data is typically very large, and data compression is necessary for storage, transmission, and feature extraction.
PCA is a simple and efficient method. Fig. 6.1 shows 10 handwritten characters that were represented as binary
32
32
matrices (left column) [183]. Such images, when scanned row by row, can be represented as 1024-dimensional vectors. For each of the 10 character classes, about 1700 handwritten samples were collected, and the sample means and covariance matrices were computed by standard estimation methods. The covariance matrices were1024
1024
matrices. For each class, the first 64 principal component vectors or eigenvectors of the covariance matrix were computed. The second column in Fig. 6.1 shows the sample means, and the other columns show the reconstructions (6.13) for various values ofm
. In the reconstructions, the sample means have been added again to scale the images for visual display. Note how a relatively small percentage of the 1024 principal components produces reasonable reconstructions.PRINCIPAL COMPONENTS 131
The condition (6.12) can often be used in advance to determine the number of principal components
m
, if the eigenvalues are known. The eigenvalue sequenced
1;d
2;:::;d
n of a covariance matrix for real-world measurement data is usually sharply decreasing, and it is possible to set a limit below which the eigenvalues, hence principal components, are insignificantly small. This limit determines how many principal components are used.Sometimes the threshold can be determined from some prior information on the vectors
x
. For instance, assume thatx
obeys a signal-noise modelx =
Xmi=1
a
is
i+ n
(6.14)where
m < n
. Therea
iare some fixed vectors and the coefficientss
i are random numbers that are zero mean and uncorrelated. We can assume that their variances have been absorbed in vectorsa
i so that they have unit variances. The termn
iswhite noise, for which Ef
nn
Tg=
2I
. Then the vectorsa
ispan a subspace, called the signal subspace, that has lower dimensionality than the whole space of vectorsx
. The subspace orthogonal to the signal subspace is spanned by pure noise and it is called the noise subspace.It is easy to show (see exercises) that in this case the covariance matrix of
x
has aspecial form:
C
x=
Xmi=1
a
ia
Ti+
2I
(6.15)The eigenvalues are now the eigenvalues of
Pmi=1
a
ia
Ti, added by the constant2.But the matrix
Pmi=1
a
ia
Ti has at mostm
nonzero eigenvalues, and these correspond to eigenvectors that span the signal subspace. When the eigenvalues ofC
x arecomputed, the first
m
form a decreasing sequence and the rest are small constants, equal to2:d
1> d
2> ::: > d
m> d
m+1= d
m+2= ::: = d
n=
2It is usually possible to detect where the eigenvalues become constants, and putting a threshold at this index,
m
, cuts off the eigenvalues and eigenvectors corresponding to pure noise. Then only the signal part remains.A more disciplined approach to this problem was given by [453]; see also [231].
They give formulas for two well-known information theoretic modeling criteria, Akaike’s information criterion (AIC) and the minimum description length criterion (MDL), as functions of the signal subspace dimension
m
. The criteria depend on the lengthT
of the samplex (1) ;:::; x ( T )
and on the eigenvaluesd
1;:::;d
nof the matrixC
x. Finding the minimum point gives a good value form
.6.1.4 Closed-form computation of PCA
To use the closed-form solution
w
i= e
igiven earlier for the PCA basis vectors, the eigenvectors of the covariance matrixC
xmust be known. In the conventional use ofPCA, there is a sufficiently large sample of vectors
x
available, from which the mean and the covariance matrixC
xcan be estimated by standard methods (see Chapter 4).Solving the eigenvector–eigenvalue problem for
C
xgives the estimate fore
1. Thereare several efficient numerical methods available for solving the eigenvectors, e.g., the QR algorithm with its variants [112, 153, 320].
However, it is not always feasible to solve the eigenvectors by standard numerical methods. In an on-line data compression application like image or speech coding, the data samples
x ( t )
arrive at high speed, and it may not be possible to estimate the covariance matrix and solve the eigenvector–eigenvalue problem once and for all.One reason is computational: the eigenvector problem is numerically too demanding if the dimensionality