3.10 Exercises
4.1.1 Newton’s method
We start by recalling Newton’s method for unconstrained minimization. New-ton’s method is an iterative method for finding roots of equations in one or more dimensions. It is one of the most important algorithms in numerical analysis and scientific computing. In convex optimization it can be used to find minimizers of convex differentiable functions. The Newton method is also the fundamental algorithm for the design of fast interior point algorithms.
Unconstrained minimization
Newton’s method is quite general. It is natural to define it in the setting of Ba-nach spaces. Chapter XVIII of the book “Functional analysis in normed spaces”
by L.V. Kantorovich and G.P. Akilov is a classical resource for this which also includes the first thorough analysis of the convergence behavior of Newton’s method. Nowadays every comprehensive book on numerical analysis contains a chapter stating explicit conditions for the convergence speed of Newton’s method.
To keep it as simple and concrete as possible we define it here only forRn. LetΩbe an open set ofRnand letf : ΩÑRbe a strictly convex, differentiable function. The Taylor approximation of the functionf around the pointais
fpa`xq “ ˆ
fpaq `∇fpaqTx`1
2xT∇2fpaqx
˙
`h.o.t., where∇fpaq PRnis thegradientoff atawith entries
∇fpaq “ ˆ B
Bx1fpaq, . . . , B Bxnfpaq
˙T
,
and where∇2fpaq PRnˆnis theHessian matrixoff atawith entries r∇2fpaqsij “ B2
BxiBxj
fpaq,
and where h.o.t. stands for “higher order terms”. Since the function is strictly convex, the Hessian matrix is positive definite,∇2fpaq P Są0n . Byq : Rn ÑR
we denote the quadratic function which we get by truncating the above Taylor approximation
qpxq “fpaq `∇fpaqTx`1
2xT∇2fpaqx.
This is a strictly convex quadratic function and so it has a unique minimizer x˚PRnwhich can be determined by setting the gradient ofqto zero:
0“∇qpx˚q
“ ˆ B
Bx1
qpx˚q, . . . , B Bxn
qpx˚q
˙T
“∇fpaq `∇2fpaqx˚.
Hence, we find the unique minimizer x˚ of q by solving a system of linear equations
x˚“ ´`
∇2fpaq˘´1
∇fpaq.
Now Newton’s method is based on approximating the functionf locally at a starting pointaby the quadratic functionq, finding the minimizer (theNewton direction)x˚of the quadratic function, updating the starting point toa`x˚and repeating this until the desired accuracy is reached:
repeat x˚Ð ´`
∇2fpaq˘´1
∇fpaq aÐa`x˚
untila stopping criterion is fulfilled.
The following fact about Newton’s method are important.
First the good news: If the starting point is close to the minimizer, then the Newton method converges quadratically (for instance the seriesnÞÑ 1012n
converges quadratically to its limit0), i.e. in every step the number of accurate digits is multiplied by a constant number.
However, if the starting point is not close to the minimizer or if the function is close to being not strictly convex, then Newton’s method does not converge well. Consider for example the convex but not strictly convex univariate func-tionfpzq “1{4z4´z. Thenf1pzq “ z3´1 andf2pzq “3z2. So if one starts the Newton iteration ata“0, one immediately is in trouble: division by zero.
If one starts ata “ ´a3
1{2, then one can perform a Newton step and one is in trouble again, etc. Figure 4.1.1 shows the fractal structure which is behind Newton’s method for solving the equationf1pzq “ z3´1 “ 0 in the complex number plane. One has similar figures for other functions.
This pure Newton method is an idealization and sometimes it cannot be performed at all because it can very well happen, thata`x˚ R Ω. One can circumvent these problems by replacing the Newton step a Ð a`x˚ by a damped Newton stepa Ð a`θx˚ with some step size θ ą0 which is chosen
Figure 4.1: Newton fractal ofz3´1 “0. The three colors indicate the region of attraction for the three roots. The shade of the color indicates the number of steps needed to come close to the corresponding root. (Source: wikipedia).
to ensure e.g. thata`θx˚ P Ω. Choosing the rightθ using a line search can be done in many ways. A popular choice is backtracking line search using the Armijo-Goldstein condition.
Let us discuss stopping criteria a bit: One possible stopping criterion is for example if the the norm of the gradient is small, i.e. for some predefined positive we do the iteration until
}∇fpaq}2ď. (4.1)
We now derive a stopping criterion in the case when the functionf is not only strictly convex but also strongly convex. This means that there is a positive constant m so that the smallest eigenvalue of all Hessian matrices of f is at leastm:
@aPΩ :λminp∇2fpaqq ěm.
By the Lagrange form of the Taylor expansion we have
@a, a`xPΩDξP ra, a`xs:fpa`xq “fpaq `∇fpaqTx`1
2xT∇2fpξqx and the strong convexity off together with the variational characterization of
the the smallest eigenvalue, which says that λminp∇2fpξqq “ min
xPRnzt0u
xT∇2fpξqx }x}2 , gives
fpa`xq ěfpaq `∇fpaqTx`1 2m}x}2. Consider the function of the right hand side
xÞÑfpaq `∇fpaqTx`1 2m}x}2. It is a convex quadratic function with gradient
xÞÑ∇fpaq `mx, hence its minimum is attained at
x˚“ ´1 m∇fpaq.
So we have for the minimumµ˚off µ˚ěfpaq `∇fpaqTp´1
m∇fpaqq `1 2m
›
›
›
› 1 m∇fpaq
›
›
›
›
2
“fpaq ´ 1
2m}∇fpaq}2,
which says that whenever the stopping criterion (4.1) is fulfilled we know that fpaqandµ˚ are at most{p2mqapart. Of course, the drawback of this consid-eration is that one has to know or estimate the constantmin advance which is often not easy. Nevertheless the consideration at least shows that the stopping criterion is sensible.
Equality-constrained minimization
In the next step we show how to modify Newton’s method if we want to find the minimum of a strictly convex, differentiable functionf : ΩÑRin an affine subspace given by the equations
aT1x“b1, aT2x“b2, . . . , aTmx“bm, wherea1, . . . , amPRnandb1, . . . , bmPR.
We define the Lagrange function
Lpx, λ1, . . . , λmq “fpxq `
m
ÿ
i“1
λiaTix,
and the method ofLagrange multiplierssays that if a pointy˚ lies in the affine space
aT1y˚“b1, . . . , aTmy˚“bm, then it is the unique minimizer off if and only if
∇Lpy˚q “0.
To find this pointy˚ we approximate the function f using the Taylor approxi-mation around the pointaby
qpxq “fpaq `∇fpaqTx`1
2xT∇2fpaqx and solve the linear system (in the variablesx˚andλ1, . . . , λm)
aT1pa`x˚q “b1, . . . , aTmpa`x˚q “bm
∇fpaq `∇2fpaqx˚`
m
ÿ
i“1
λiai “0
to find the Newton directionx˚. Then we can do the same Newton iterations using damped Newton steps as in the case of unconstrained optimization.