Recently, Miao and Hua [300] proposed such a criterion: again denoting by
W =
( w
1::: w
m)
T the matrix whose rows are the weight vectors, this “novel information criterion” (NIC) isFACTOR ANALYSIS 139
are generally not assumed to be equal or infinitely small, as in the special case of principal FA. We can write the covariance matrix of the observations from (6.30) as
Ef
xx
Tg= C
x= AA
T+ Q
(6.32)In practice, we have a good estimate of
C
x available, given by the sample co-variance matrix. The main problem is then to solve the matrixA
of factor loadings and the diagonal noise covariance matrixQ
such that they will explain the observed covariances from (6.32). There is no closed-form analytic solution forA
andQ
.Assuming
Q
is known or can be estimated, we can attempt to solveA
fromAA
T= C
x?Q
. The number of factors is usually constrained to be much smaller than the number of dimensions in the data, so this equation cannot be exactly solved;something similar to a least-squares solution should be used instead. Clearly, this problem does not have a unique solution: any orthogonal transform or rotation of
A
!AT
, withT
an orthogonal matrix (for whichTT
T= I
), will produce exactly the same left-hand side. We need some extra constraints to make the problem more unique.Now, looking for a factor-based interpretation of the observed variables, FA typ-ically tries to solve the matrix
A
in such a way that the variables would have high loadings on a small number of factors, and very low loadings on the remaining fac-tors. The results are then easier to interpret. This principle has been used in such techniques as varimax, quartimax, and oblimin rotations. Several classic techniques for such factor rotations are covered by Harman [166].There are some important differences between PCA, FA, and ICA. Principal component analysis is not based on a generative model, although it can be derived from one. It is a linear transformation that is based either on variance maximization or minimum mean-square error representation. The PCA model is invertible in the (theoretical) case of no compression, i.e., when all the principal components are retained. Once the principal components
y
ihave been found, the original observations can be readily expressed as their linear functions asx =
Pni=1y
iw
i, and also the principal components are simply obtained as linear functions of the observations:y
i= w
Tix
.The FA model is a generative latent variable model; the observations are expressed in terms of the factors, but the values of the factors cannot be directly computed from the observations. This is due to the additive term of specific factors or noise which is considered important in some application fields. Further, the rows of matrix
A
are generally not (proportional to) eigenvectors of
C
x; several different estimation methods exist.FA, as well as PCA, is a purely second-order statistical method: only covariances between the observed variables are used in the estimation, which is due to the assumption of gaussianity of the factors. The factors are further assumed to be uncorrelated, which also implies independence in the case of gaussian data. ICA is a similar generative latent variable model, but now the factors or independent components are assumed to be statistically independent and nongaussian — a much stronger assumption that removes the rotational redundancy of the FA model. In fact, ICA can be considered as one particular method of determining the factor rotation.
The noise term is usually omitted in the ICA model; see Chapter 15 for a detailed discussion on this point.
6.4 WHITENING
As already discussed in Chapter 1, the ICA problem is greatly simplified if the observed mixture vectors are first whitened or sphered. A zero-mean random vect or
z = ( z
1:::z
n)
T is said to be white if its elementsz
i are uncorrelated and have unit variances:Ef
z
iz
jg=
ijIn terms of the covariance matrix, this obviously means that Ef
zz
Tg= I
, withI
the unit matrix. The best-known example is white noise; then the elements
z
iwould be the intensities of noise at consequent time pointsi = 1 ; 2 ;:::
and there are no temporal correlations in the noise process. The term “white” comes from the fact that the power spectrum of white noise is constant over all frequencies, somewhat like the spectrum of white light contains all colors.A synonym for white is sphered. If the density of the vector
z
is radially symmetric and suitably scaled, then it is sphered. An example is the multivariate gaussian density that has zero mean and unit covariance matrix. The opposite does not hold:the density of a sphered vector does not have to be radially symmetric. An example is a two-dimensional uniform density that has the shape of a rotated square; see Fig. 7.10. It is easy to see that in this case both the variables
z
1 andz
2 on thecoordinate axes have unit variance (if the side of the square has length
2
p3
) and theyare uncorrelated, independently of the rotation angle. Thus vector
z
is sphered, even if the density is highly nonsymmetric. Note that the densities of the elementsz
iof a sphered random vector need not be the same.Because whitening is essentially decorrelation followed by scaling, the technique of PCA can be used. This implies that whitening can be done with a linear operation.
The problem of whitening is now: Given a random vector
x
withn
elements, find a linear transformationV
into another vectorz
such thatz = Vx
is white (sphered).
The problem has a straightforward solution in terms of the PCA expansion. Let
E = ( e
1::: e
n)
be the matrix whose columns are the unit-norm eigenvectors of the covariance matrixC
x=
Efxx
Tg. These can be computed from a sample of the vectorsx
either directly or by one of the on-line PCA learning rules. LetD =
diag( d
1:::d
n)
be the diagonal matrix of the eigenvalues ofC
. Then a linear whitening transform is given byV = D
?1=2E
T (6.33)This matrix always exists when the eigenvalues
d
iare positive; in practice, this is not a restriction. Remember (see Chapter 4) thatC
xis positive semidefinite, in practice positive definite for almost any natural data, so its eigenvalues will be positive.ORTHOGONALIZATION 141
It is easy to show that the matrix
V
of Eq. (6.33) is indeed a whitening trans-formation. Recalling thatC
xcan be written in terms of its eigenvector and eigen-value matricesE
andD
asC
x= EDE
T, withE
an orthogonal matrix satisfyingE
TE = EE
T= I
, it holds:Ef
zz
Tg= V
Efxx
TgV
T= D
?1=2E
TEDE
TED
?1=2= I
The covariance of
z
is the unit matrix, hencez
is white.The linear operator
V
of (6.33) is by no means the only unique whitening matrix.It is easy to see that any matrix
UV
, withU
an orthogonal matrix, is also a whitening matrix. This is because forz = UVx
it holds:Ef
zz
Tg= UV
Efxx
TgV
TU
T= UIU
T= I
An important instance is the matrix
ED
?1=2E
T. This is a whitening matrix because it is obtained by multiplyingV
of Eq. (6.33) from the left by the orthogonal matrixE
. This matrix is called the inverse square root ofC
x, and denoted byC
?1x =2,because it comes from the standard extension of square roots to matrices.
It is also possible to perform whitening by on-line learning rules, similar to the PCA learning rules reviewed earlier. One such direct rule is
V = ( I
?Vxx
TV
T) V = ( I
?zz
T) V
(6.34)It can be seen that at a stationary point, when the change in the value of
V
is zero onthe average, it holds
( I
?Efzz
Tg) V = 0
for which a whitened
z = Vx
is a solution. It can be shown (see, e.g., [71]) that the algorithm will indeed converge to a whitening transformationV
.6.5 ORTHOGONALIZATION
In some PCA and ICA algorithms, we know that in theory the solution vectors (PCA basis vectors or ICA basis vectors) are orthogonal or orthonormal, but the iterative algorithms do not always automatically produce orthogonality. Then it may be necessary to orthogonalize the vectors after each iteration step, or at some suitable intervals. In this subsection, we look into some basic orthogonalization methods.
Simply stated, the problem is as follows: given a set of
n
-dimensional linearly independent vectorsa
1;:::; a
m, withm
n
, compute another set ofm
vectorsw
1;:::; w
m that are orthogonal or orthonormal (i.e., orthogonal and having unit Euclidean norm) and that span the same subspace as the original vectors. This means that eachw
iis some linear combination of thea
j.The classic approach is the Gram-Schmidt orthogonalization (GSO) method [284]:
w
1= a
1 (6.35)w
j= a
j?jX?1 i=1w
Tia
jw
Tiw
iw
i (6.36)As a result,
w
Tiw
j= 0
fori
6= j
, as is easy to show by induction. Assume that the firstj
?1
basis vectors are already orthogonal; from (6.36) it then follows for anyk < j
thatw
Tkw
j= w
Tka
j?Pj?1i=1 wwTiTiwaji
( w
Tkw
i)
. In the sum, all the inner productsw
Tkw
iare zero except the one wherei = k
. This term becomes equal towTkaj
wTkwk
( w
Tkw
k) = w
Tka
j, and thus the inner productw
Tkw
jis zero, too.If in the GSO each
w
jis further divided by its norm, the set will be orthonormal.The GSO is a sequential orthogonalization procedure. It is the basis of deflation approaches to PCA and ICA. A problem with sequential orthogonalization is the cumulation of errors.
In symmetric orthonormalization methods, none of the original vectors
a
iis treated differently from the others. If it is sufficient to find any orthonormal basis for the subspace spanned by the original vectors, without other constraints on the new vectors, then this problem does not have a unique solution. This can be accomplished for instance by first forming the matrixA = ( a
1::: a
m)
whose columns are the vectors to be orthogonalized, then computing( A
TA )
?1=2using the eigendecomposition of the symmetric matrix( A
TA )
, and finally puttingW = A ( A
TA )
?1=2 (6.37)Obviously, for matrix
W
it holdsW
TW = I
, and its columnsw
1;:::; w
m span the same subspace as the columns of matrixA
. These vectors are thus a suitable orthonormalized basis. This solution to the symmetric orthonormalization problem is by no means unique; again, any matrixWU
withU
an orthogonal matrix will do quite as well.However, among these solutions, there is one specific orthogonal matrix that is closest to matrix
A
(in an appropriate matrix norm). Then this matrix is the orthogonal projection ofA
onto the set of orthogonal matrices [284]. This is somewhat analogous to the normalization of one vectora
; the vectora =
ka
kis the projection ofa
ontothe set of unit-norm vectors (the unit sphere). For matrices, it can be shown that the matrix
A ( A
TA )
?1=2in Eq. (6.37) is in fact the unique orthogonal projection ofA
onto this set.
This orthogonalization should be preferred in gradient algorithms that minimize a functionJ
( W )
under the constraintW
TW = I
. As explained in Chapter 3, one iteration step consists of two parts: first, the matrixW
is updated by the usual gradient descent, and second, the updated matrix is projected orthogonally onto the constraint set. For this second stage, the form given in (6.37) for orthogonalizing the updated matrix should be used.There are iterative methods for symmetric orthonormalization that avoid the matrix eigendecomposition and inversion. An example is the following iterative algorithm [197], starting from a nonorthogonal matrix
W (0)
:W (1) = W (0) =
kW (0)
k;
(6.38)W ( t + 1) = 32 W ( t )
?1 2 W ( t ) W ( t )
TW ( t )
(6.39)CONCLUDING REMARKS AND REFERENCES 143
The iteration is continued until
W ( t )
TW ( t )
I
. The convergence of this iteration can be proven as follows [197]: matricesW ( t )
TW ( t )
andW ( t + 1)
TW ( t +
1) =
94W ( t )
TW ( t )
?32[ W ( t )
TW ( t )]
2+
14[ W ( t )
TW ( t )]
3have clearly the same eigenvectors, and the relation between the eigenvalues isd ( t + 1) = 94 d ( t )
?3
2 d
2( t ) + 14 d
3( t )
(6.40)This nonlinear scalar iteration will converge on the interval
[0 ; 1]
to 1 (see exercises).Due to the original normalization, all the eigenvalues are on this interval, assuming that the norm in the normalization is appropriately chosen (it must be a proper norm in the space of matrices; most conventional norms, except for the Frobenius norm, have this property). Because the eigenvalues tend to 1, the matrix itself tends to the unit matrix.
6.6 CONCLUDING REMARKS AND REFERENCES
Good general discussions on PCA are [14, 109, 324, 112]. The variance maximiza-tion criterion of PCA covered in Secmaximiza-tion 6.1.1 is due to Hotelling [185], while in the original work by Pearson [364], the starting point was minimizing the squared reconstruction error (Section 6.1.2). These are not the only criteria leading to the PCA solution; yet another information-theoretic approach is maximization of mutual information between the inputs and outputs in a linear gaussian channel [112]. An expansion closely related to PCA is the Karhunen-Lo`eve expansion for continuous second-order stochastic processes, whose autocovariance function can be expanded in terms of its eigenvalues and orthonormal eigenfunctions in a convergent series [237, 283].
The on-line algorithms of Section 6.2 are especially suitable for neural network implementations. In numerical analysis and signal processing, many other adaptive algorithms of varying complexity have been reported for different computing hard-ware. A good review is given by Comon and Golub [92]. Experimental results on PCA algorithms both for finding the eigenvectors of stationary training sets, and for tracking the slowly changing eigenvectors of nonstationary input data streams, have been reported in [324, 391, 350]. An obvious extension of PCA neural networks would be to use nonlinear units, e.g., perceptrons, instead of the linear units. It turns out that such “nonlinear PCA” networks will in some cases give the independent components of the input vectors, instead of just uncorrelated components [232, 233]
(see Chapter 12).
Good general texts on factor analysis are [166, 243, 454]. The principal FA model has been recently discussed by [421] and [387].
Problems
6.1 Consider the problem of maximizing the variance of
y
m= w
Tmx ( m =
1 ;:::;n )
under the constraint thatw
mmust be of unit Euclidean norm and orthogonal to all the previously-found principal vectorsw
i; i < m
. Show that the solution is given byw
m= e
mwithe
mthe eigenvector ofC
xcorresponding to them
th largesteigenvalue.
6.2 Show that the criterion (6.9) is equivalent to the mean-square error (6.7). Show that at the optimum, if
w
i= e
i, the value of (6.7) is given by (6.10).6.3 Given the data model (6.14), show that the covariance matrix has the form (6.15).
6.4 The learning rule for a PCA neuron is based on maximization of
y = ( w
Tx )
2under constraintk
w
k= 1
. (We have now omitted the subscript 1 because only one neuron is involved.)6.4.1. Show that an unlimited gradient ascent method would compute the new vector
w
fromw w + ( w
Tx ) x
with
the learning rate. Show that the norm of the weight vector always grows in this case.6.4.2. Thus the norm must be bounded. A possibility is the following update rule:
w [ w + ( w
Tx ) x ] =
kw + ( w
Tx ) x
kNow the norm will stay equal to 1. Derive an approximation to this update rule for a small value of
, by taking a Taylor expansion of the right-hand side with respect to and dropping all higher powers of. Leave only terms linear in. Show that the result isw w + [( w
Tx ) x
?( w
Tx )
2w ]
which is the basic PCA learning rule of Eq. (6.16).
6.4.3. Take averages with respect to the random input vector
x
and show that in a stationary point of the iteration, where there is no change on the average in the value ofw
, it holds:C
xw = ( w
TC
xw ) w
withC
x=
Efxx
Tg.6.4.4. Show that the only possible solutions will be the eigenvectors of
C
x.6.5 The covariance matrix of vector
x
isC
x=
2 : 5 1 : 5 1 : 5 2 : 5
(6.41) Compute a whitening transformation for
x
.6.6 * Based on the first step (6.38) of the orthogonalization algorithm, show that
0 < d (1) < 1
whered (1)
is any eigenvalue ofW (1)
. Next consider iteration (6.40).Write