5.3 Descent methods and conjugate gradients
5.3.4 Convergence of CG
The convergence theory for CG is related to the fact thatuk minimizesφ(u) over the Krylov spaceKk defined in the previous section. We now show that the A-norm of the error ek is also minimized over all possible choices of vectorsuin Kk. The A-norm is defined by
kekA=√
eTAe. (5.41)
This defines a norm that satisfies the requirements of a vector norm provided thatAis SPD, which we are assuming in studying the CG method. This is a natural norm to use because
kek2A= (u−u∗)TA(u−u∗)
=uTAu−2uTAu∗+u∗TAu∗
= 2φ(u) +u∗TAu∗.
(5.42)
Sinceu∗TAu∗ is a fixed number, we see that minimizingkekAis equivalent to minimizing φ(u).
Since
uk=u0+α0p0+α1p1+· · ·+αk−1pk−1
we find by subtracting u∗ that
ek=e0+α0p0+α1p1+· · ·+αk−1pk−1.
Hence ek−e0 is inKk and by Theorem 5.3.1 lies in span(f, Af, . . . , Ak−1f). Since u0 = 0 we have f =Au∗=−Ae0 and soek−e0also lies in span(Ae0, A2e0, . . . , Ake0). Soek=e0+c1Ae0+c2A2e0+
· · ·+ckAke0 for some coefficientsc1, . . . , ck. In other words,
ek=Pk(A)e0 (5.43)
where
Pk(A) =I+c1A+c2A2+· · ·+ckAk (5.44)
is a polynomial inA. For a scalar valuexwe have
Pk(x) = 1 +c1x+c2x2+· · ·+ckxk (5.45) andPk ∈ Pk where
Pk ={polynomialsP(x) of degree at mostksatisfying P(0) = 1}. (5.46) The polynomialPk constructed implicitly by the CG algorithm solves the minimization problem
Pmin∈PkkP(A)e0kA. (5.47)
In order to understand how a polynomial function of a matrix behaves, recall that A=RΛR−1 =⇒ Aj =RΛjR−1
and so
Pk(A) =RPk(Λ)R−1, where
Pk(Λ) =
Pk(λ1)
Pk(λ2) . ..
Pk(λm)
.
Note, in particular, that ifPk(x) has a root at each eigenvalueλ1, . . . , λmthenPk(Λ) is the zero matrix and soek =Pk(A)e0= 0. IfAhasndistinct eigenvaluesλ1, . . . , λnthen there is a polynomialPn∈ Pn that has these roots and hence the CG algorithm converges in at most niterations, as was previously claimed. The polynomial that CG automatically constructs is simplyPn(x) = (1−x/λ1)· · ·(1−x/λn).
To get an idea of how smallke0kA will be at some earlier point in the iteration, we will show that for any polynomial P(x) we have
kP(A)e0kA
ke0kA ≤ max
1≤j≤m|P(λj)| (5.48)
and then exhibit one polynomial ˜Pk∈ Pk for which we can use this to obtain a useful upper bound on kekkA/ke0kk.
SinceAis SPD, Ahas an orthonormal set of eigenvectorsrj,j = 1, 2, . . . , m, and any vectore0
can be written as
e0=
m
X
j=1
ajrj
for some coefficientsa1, . . . , am. Since therj are orthonormal, we find that ke0k2A=
m
X
j=1
a2jλj. (5.49)
This may be easiest to see in matrix form: e0 =Ra where R is the matrix of eigenvectors anda the vector of coefficients. Since RT =R−1, we have
ke0k2A=eT0Ae0=aTRTARe0=aTΛa, which gives (5.49).
IfP(A) is any polynomial inAthen
P(A)e0=
m
X
j=1
ajP(λj)rj
(a)−1.5 0 2 4 6 8 10
−1
−0.5 0 0.5 1 1.5
(b)−1.5 0 2 4 6 8 10
−1
−0.5 0 0.5 1 1.5
Figure 5.7: (a) The polynomial ˜P1(x) based on a sample set of eigenvalues marked by dots on the x-axis. (b) The polynomial ˜P2(x) for the same set of eigenvalues. See Figure 5.8(a) for the polynomial P˜5(x).
and so
kP(A)e0k2A=
m
X
j=1
a2jP(λj)2λj
≤
1≤maxj≤m(P(λj))2 m
X
j=1
a2jλj.
(5.50)
Combining this with (5.49) gives (5.48).
We will now show that for a particular choice of polynomials ˜Pk ∈ Pkwe can evaluate the right-hand side of (5.48) and obtain a bound that decreases with increasingk. Since the polynomialPkconstructed by CG solves the problem (5.47), we know that
kPk(A)e0kA≤ kP˜k(A)e0kA,
and so this will give a bound for the convergence rate of the CG algorithm.
Consider the casek= 1, after one step of CG. We choose the linear function P˜1(x) = 1− 2x
λm+λ1
, (5.51)
where we assume the eigenvalues are ordered 0 < λ1 ≤λ2 ≤ · · · ≤ λm. A typical case is shown in Figure 5.7(a). The linear function ˜P1(x) = 1 +c1xmust pass throughP1(0) = 1 and the slopec1 has been chosen so that
P˜1(λ1) =−P˜1(λm) which gives
1 +c1λ1=−1−c1λm =⇒ c1=− 2 λm+λ1
.
If the slope were made any larger or smaller then the value of|P˜1(λ)|would increase at eitherλm or λ1, respectively; see Figure 5.7(a). For this polynomial we have
1≤maxj≤m|P˜1(λj)|= ˜P1(λ1) = 1− 2λ1
λm+λ1
= λm/λ1−1 λm/λ1+ 1
= κ−1 κ+ 1
(5.52)
where κ= κ2(A) is the condition number of A. This gives an upper bound on the reduction of the error in the first step of the CG algorithm and is the best estimate we can obtain knowing only the distribution of eigenvalues of A. The CG algorithm constructs the actual P1(x) based on e0 as well as A and may do better than this for certain initial data. For example if e0 =ajrj has only a single eigencomponent then P1(x) = 1−x/λj reduces the error to zero in one step. This is the case where the initial guess lies on an axis of the ellipsoid and the residual points directly to U∗. But the above bound is the best we can obtain that holds for anye0.
Now consider the casek= 2, after two iterations of CG. Figure 5.7(b) shows the quadratic function P˜2(x) that has been chosen so that
P˜2(λ1) =−P˜1((λm+λ1)/2) = ˜P2(λm).
This function equioscillates at three points in the interval [λ1, λm] where the maximum amplitude is taken. This is the polynomial from P2 that has the smallest maximum value on this interval, i.e., it minimizes
λ1≤maxx≤λm|P(x)|.
This polynomial does not necessarily solve the problem of minimizing
1≤maxj≤m|P(λj)|
unless (λ1+λm)/2 happens to be an eigenvalue, since we could possibly reduce this quantity by choosing a quadratic with a slightly larger magnitude near the midpoint of the interval but a smaller magnitude at each eigenvalue. However, it has the great virtue of being easy to compute based only on λ1 and λm. Moreover we can compute the analogous polynomial ˜Pk(x) for arbitrary degreek, the polynomial from ˜Pkwith the property of minimizing the maximum amplitude over the entire interval [λ1, λm]. The resulting maximum amplitude can also be computed in terms ofλ1 and λm, and in fact depends only on the ratio of these and hence depends only on the condition number ofA. This gives an upper bound for the convergence rate of CG in terms of the condition number ofA that is often quite realistic.
The polynomials we want are simply shifted and scaled versions of the Chebyshev polynomials discussed in Section 4.2.5. Recall that Tk(x) equioscillates on the interval [−1,1] with the extreme values ±1 being taken at k+ 1 points, including the endpoints. We shift this to the interval [λ1, λm] and scale it so that the value atx= 0 is 1, and obtain
P˜k(x) = Tk
λm+λ1−2x λm−λ1
Tk
λm+λ1
λm−λ1
. (5.53)
Fork= 1 this gives (5.51) sinceT1(x) =x. We now only need to compute
1max≤j≤m|P˜k(λj)|= ˜Pk(λ1) in order to obtain the desired bound on kekkA. We have
P˜k(λ1) = Tk(1) Tk
λm+λ1
λm−λ1
= 1 Tk
λm+λ1
λm−λ1
. (5.54)
Note that
λm+λ1
λm−λ1
= λm/λ1+ 1
λm/λ1−1 =κ+ 1 κ−1 >1
so we need to evaluate the Chebyshev polynomial at a point outside the interval [−1,1]. Recall the formula (4.17) for the Chebyshev polynomial that is valid for x in the interval [−1,1]. Outside this interval there is an analogous formula in terms of the hyperbolic cosine,
Tk(x) = cosh(kcosh−1x).
We have
cosh(z) = ez+e−z
2 =1
2(y+y−1)
wherey=ez, so if we make the change of variablesx=12(y+y−1) then cosh−1x=z and Tk(x) = cosh(kz) = ekz+e−kz
2 =1
2(yk+y−k).
We can findy from any givenxby solving the quadratic equationy2−2xy+ 1 = 0, yielding y=x±p
x2−1.
To evaluate (5.54) we need to evaluateTk at x= (κ+ 1)/(κ−1), where we obtain
y= κ+ 1 κ−1±
s κ+ 1
κ−1 2
−1
= κ+ 1±√ 4κ κ−1
= (√κ±1)2 (√κ+ 1)(√κ−1)
=
√κ+ 1
√κ−1 or
√κ−1
√κ+ 1.
(5.55)
Either choice ofy gives the same value for Tk
κ+ 1 κ−1
=1 2
"√κ+ 1
√κ−1 k
+
√κ−1
√κ+ 1 k#
. (5.56)
Using this in (5.54) and combining with (5.48) gives kP(A)e0kA
ke0kA ≤2
"√ κ+ 1
√κ−1 k
+ √
κ−1
√κ+ 1 k#−1
≤2 √
κ−1
√κ+ 1 k
. (5.57)
This gives an upper bound on the error when the CG algorithm is used. In practice the error may be smaller, either because the initial error e0 happens to be deficient in some eigencoefficients, or more likely because the optimal polynomialPk(x) is much smaller at all the eigenvaluesλj than our choice P˜k(x) used to obtain the above bound. This typically happens if the eigenvalues ofAare clustered near fewer than mpoints. Then the Pk(x) constructed by CG will be smaller near these points and larger on other parts of the interval [λ1, λm] where no eigenvalues lie.
Figure 5.8 shows some examples for the casek= 5. In Figure 5.8(a) the same eigenvalue distribution as in Figure 5.7 is assumed, and the shifted Chebyshev polynomial ˜P5(x) is plotted. This gives an upper boundke5kA/ke0kA≤0.0756 for a matrixAwith these eigenvalues, which has condition numberκ= 10.
Figure 5.8(b) shows a different eigenvalue distribution, for a matrix A that is better conditioned, withλ1= 2 andλm= 10 soκ= 5. In this caseke5kA/ke0kA≤0.0163.
Figure 5.8(c) shows the situation for a matrix that is more poorly conditioned, with λ1= 0.2 and λm= 10 soκ= 50. Using the Chebyshev polynomial ˜P5(x) shown in this figure gives an upper bound of ke5kA/ke0kA ≤ 0.4553. For a matrix A with this condition number but with many eigenvalues scattered more or less uniformly throughout the interval [0.2,10], this would be a realistic estimate of the reduction in error after 5 steps of CG. For the eigenvalue distribution shown in the figure, however, CG is in fact able to do much better and constructs a polynomial P5(x) that might look more like the
(a)−1.5 0 2 4 6 8 10
−1
−0.5 0 0.5 1 1.5
(b)−1.5 0 2 4 6 8 10
−1
−0.5 0 0.5 1 1.5
(c)−1.5 0 2 4 6 8 10
−1
−0.5 0 0.5 1 1.5
(d)−1.5 0 2 4 6 8 10
−1
−0.5 0 0.5 1 1.5
Figure 5.8: (a) The polynomial ˜P5(x) based on a sample set of eigenvalues marked by dots on the x-axis, the same set as in Figure 5.7. (b) The polynomial ˜P5(x) for a matrix with smaller κ. (c) The polynomial ˜P5(x) for a matrix with largerκ. (d) A better polynomialP(x) of degree 5 for the same eigenvalue distribution as in figure (c).
one shown in Figure 5.8(d), which is small near each of the three clusters of eigenvalues but huge in between.
The bound (5.57) is only an upper bound and may be pessimistic when the eigenvalues are clustered, which sometimes happens in practice. As an iterative method it is really the number of clusters, not the number of mathematically distinct eigenvalues, that then determines how rapidly CG converges in practical terms.
The bound (5.57) is realistic for many matrices, however, and shows that in general the convergence rate depends on the size of the condition numberκ. Ifκis large then
2
√κ−1
√κ+ 1 k
≈2
1− 2
√κ k
≈2e−2k/√κ, (5.58)
and we expect that the number of iterations required to reach a desired tolerance will bek=O(√κ).
For example, the standard second-order discretization of the Poisson problem on a grid withmpoints in each direction gives a matrix withκ=O(1/h2), whereh= 1/(m+1). The bound (5.58) suggests that CG will requireO(m) iterations to converge, which is observed in practice. This is true in any number of space dimensions. In one dimension where there are onlymunknowns this does not look very good (and of course it’s best to just solve the tridiagonal system by elimination). In two dimensions there arem2unknowns andm2work per iteration is required to computeApk−1, so CG requiresO(m3) work to converge to a fixed tolerance, which is significantly better than Gauss elimination and comparable to SOR with the optimal ω. Of course for this problem a fast Poisson solver could be used, requiring onlyO(m2logm) work. But for other problems, such as variable coefficient elliptic equations, CG may still work very well while SOR only works well if the optimalω is found, which may be impossible, and FFT methods are inapplicable. Similar comments apply in three dimensions.