Short course: Optimality Conditions and Algorithms in Nonlinear Optimization

(1)

Short course: Optimality Conditions and Algorithms in Nonlinear Optimization

Part I - Introduction to nonlinear optimization

Gabriel Haeser

Department of Applied Mathematics Institute of Mathematics and Statistics

University of São Paulo São Paulo, SP, Brazil

Santiago de Compostela, Spain, October 28-31, 2014

(2)

Outline

Part I - Introduction to nonlinear optimization Examples and historical notes

First and Second order optimality conditions Penalty methods

Interior point methods Part II - Optimality Conditions

Algorithmic proof of Karush-Kuhn-Tucker conditions Sequential Optimality Conditions

Algorithmic discussion

Part III - Constraint Qualifications Geometric Interpretation

First and Second order constraint qualifications Part IV - Algorithms

Augmented Lagrangian methods Inexact Restoration algorithms Dual methods

www.ime.usp.br/∼ghaeser

(3)

(4)

Optimization

Optimization is a mathematical problem with many “real world”

applications. The goal is to find minimizers or maximizers of a multivariable real function, under a restricted domain.

to draw a map of America with areas proportional to the

real areas

hard-spheres problem: to placempoints on a n-dimensional sphere in such

a way that the smallest distance between two points

is maximized.

(5)

Problem America

To draw a map of America, similar to the usual map, with areas proportional to real areas.

Minimize ¹₂Pm

i=1kpi−¯pik², Subject to ¹₂Pn_j

i=1(p^x_ip^y_i+1−p^x_i+1p^y_i) =βj,j=1, . . . ,c

c=17countries

βjis the real area of countryj m=132given points¯pi on the frontiers of the usual map

Green-Gauss formula to compute areas

(6)

Problem America

United States (without Alaska and Hawaii) = 8.080.464km² Brazil = 8.514.876km² Usual map ratio ≈ 1.32

Real ratio ≈ 0.95

Usual map Areas proportional to real areas

(7)

Problem America

Areas proportional to GDP Areas proportional to population

(8)

Kissing and hard-spheres problems

Thekissing numberof dimensionnis the largest number of unit spheres that may be put touching an-dimensional unit sphere without overlapping.

Thehard-spheres problem consists of maximizing the smallest distanced betweenmpoints on then-dimensional sphere of ra- dius2.

n Kissing number

2 6

3 12

4 24

5 40–44

6 72–78

7 126–134

8 240

9 306–364

10 500–554

d^∗≥2⇒kissing number≥m

n=2, n=3,

m=6,d^∗=2 m=12,d^∗ ≈2.194

(9)

Applications: Packing

(10)

Applications: Packing

Initial configuration for molecular dynamics

(11)

Large scale problems: Finance

Jacek Gondzio and Andreas Grothey (May 2005):

quadratic convex program with 353 million constraints and 1010 million variables.

Tool: Interior Point Method

(12)

Large scale problems: Localization

Find a pointin the rectangle but not in the ellipsissuch that the sum of the distances to thepolygonsis minimized.

1.567.804polygons.

3.135.608variables.

1.567.804upper level constraints.

12.833.106lower level constraints.

convergence in 10outer iterations, 56inner iterations, 133funct. evaluations, 185seconds.

Tool: Augmented Lagrangian method

(13)

TANGO Project - www.ime.usp.br/∼egbirgin/tango

TrustableAlgorithms forNonlinearGeneralOptimization

(14)

TANGO Project - www.ime.usp.br/∼egbirgin/tango

40.370 visits registered by Google Analytics - Since 2007 (More than 3.000 downloads)

USA: 7.969, Brazil: 7.230, Germany: 2.974

(15)

TANGO Project - www.ime.usp.br/∼egbirgin/tango

Spain: 733

(16)

Historical Notes

Military Programsformulated as asystem of linear inequalitiesgave rise to the termProgramming in a linear structure(title of the first paper by G. Dantzig, 1948).

Koopmans shortened the term toLinear Programming.

Dorfman (in 1949) thought that Linear Programming was too restrictive and suggest the more general term

Mathematical Programming, now calledMathematical Optimization.

Nonlinear Programmingis the title of the 1951 paper by Kuhn and Tucker that deals withOptimality Conditions.

These results are the extension of the Lagrange rule of multipliers (1813) to the case of equality and inequality constraints. These were previously considered on the 1939 unpublished master’s thesis of Karush (KKT conditions).

These works are particularly important because they suggest thedevelopment of algorithms to deal with practical problems.

(17)

Historical Notes

Linear Programmingis part of a revolutionary development that gave humanity the capability toformulate an objective and determine a way of detailed decisions toreach this goal in the best way possible.

Tools: Models, algorithms, computers and softwares.

Theimpossibility to perform large computationsis the main reason, according to Dantzig, to the lack of interest in optimization before 1947.

Important topics in computing: (a) Dealing withsparsityallows for solving larger problems; (b)Global optimization; (c)

Automatic differentiationof a function represented in a programming language.

(18)

Automatic Differentiation

f(x₁,x₂) =sin(x₁) +x₁x₂

(19)

Duality

Game theory and linear programming:

1948 - G. Dantzig visited John von Neumann in Princeton.

J. von Neumann, 1963. Discussion of a maximum problem.

D. Gale, H. W. Kuhn, A. W. Tucker, 1951. Linear programming and the theory of games.

Elements of duality:

apair of optimization problems, one a maximum problem with objective functionf and the other a minimum problem with objective functionh, based on the same data

forfeasible solutionsto the pair of problems, alwaysh≥f necessary and sufficient conditions foroptimalityareh=f

(20)

Duality

(Fermat XVII century): Given3pointsp₁,p₂andp₃on the plane, find the pointxthat minimizes the sum of the distances fromxto p₁,p₂andp₃.

(21)

Duality

(Thomas Moss,The Ladies Diary, 1755): “In the three sides of an equiangular field stand three trees, at the distances of10,12 and16chains from one another: to find the content of the field, it being the greatest the data will admit.”

(22)

Duality

(J.D. Gergonne (ed), Annales de Mathématiques Pures et Ap- pliquées, 1810-1811): Given any triangle, circumscribe the largest possible equilateral triangle about it.

Solution given in the 1811-1812 edition by Rochat, Vecten, Fauguier and Pilatte where duality was acknowledged.

(23)

The problem (NLP)

Minimize f(x),

Subject to h_i(x) =0,i=1, . . . ,m.

gj(x)≤0,j=1, . . . ,p.

f,h_i,g_j :Rⁿ→Rare (twice) continuously differentiable functions.

Ω ={x∈Rⁿ |h(x) =0,g(x)≤0}(feasible set)

(24)

Solution

Global Solution: A feasible pointx^∗∈Ωis a global minimizer of NLP when

f(x^∗)≤f(x),∀x∈Ω

Local Solution: A feasible pointx^∗ ∈Ωis a local minimizer of NLP when there exists a neighbourhoodB(x^∗, ε)ofx^∗ such that

f(x^∗)≤f(x),∀x∈Ω∩ B(x^∗, ε)

A(x) ={j∈ {1, . . . ,p} |gj(x) =0}(set of active inequalities at x∈Ω)

(25)

Example

Minimize x²+y², Subject to x+y−1=0.

(26)

First order optimality condition - Lagrange multipliers

Minimize x²+y², Subject to x+y−1=0.

x= ¹₂,y= ¹₂, 1

1

+ (−1) 1

1

=0

(27)

Example

Maximize x²+y²,

Subject to x+2y−2≤0, x≥0,

y≥0.

(28)

Minimize −x²−y², Subject to x+2y−2≤0,

−x≤0,

−y≤0.

x=2,y=0, −4

0

+4 1

2

+8 0

−1

=0 x=0,y=1,

0

−2

+1 1

2

+1 −1

0

=0 x=0.4,y=0.8,

−0.8

−1.6

+0.8 1

2

=0

(29)

First order optimality condition - KKT condition

(Karush-Kuhn-Tucker) Under some condition (constraint qualification), ifx^∗is a local solution, there exist Lagrange multipliersλ∈R^mandµ∈R^p such that:

∇f(x) +

m

X

i=1

λi∇h_i(x^∗) +

p

X

j=1

µj∇g_j(x^∗) =0, (Lagrange condition)

µjgj(x^∗) =0,j=1, . . . ,p, (complementarity) h(x^∗) =0,g(x^∗)≤0, (feasibility)

µ≥0.(dual feasibility)

Interpretation: up to first order, a feasible direction cannot be a descent direction.

(30)

Second order optimality condition

x^∗= 0.4

0.8

,∇g₁(x^∗) = 1

2

,∇²f(x^∗) =

−2 0 0 −2

. There exists somed∈Rⁿ,∇g₁(x^∗)^Td≤0,d^T∇²f(x^∗)d<0.

Theorem:Under some conditions, ifx^∗ is a local minimizer

d^T



∇²f(x) +

m

X

i=1

λi∇²hi(x^∗) +

p

X

j=1

µj∇²gj(x^∗)



d ≥0,

for everyd∈Rⁿ such that

∇f(x^∗)^Td≤0,

∇h_i(x^∗)^Td=0,i=1. . . ,m

∇g_j(x^∗)^Td≤0,j∈A(x^∗).

Interpretation: All critical directions must be of ascent nature.

(31)

History of nonlinear programming

Kuhn, Tucker, 1951.

Nonlinear programming.

Albert William Tucker (1905 - 1995)

Princeton University Topology

Harold William Kuhn (1925 - 2014) Princeton University PhD 1950, Algebra

Game Theory, Optimization

Saddle point problem

φ(x^∗,u)≤φ(x^∗,u^∗)≤φ(x,u^∗),∀x,u

(32)

History of nonlinear programming

William Karush (1917-1997)

1939. Minima of Functions of Several Variables with Inequalities as Side Conditions.

M.Sc. thesis, Department of Mathematics, University of Chicago

Calculus of Variations and Optimization

University of Chicago and California State University (also Manhattan Project)

I concluded that you two had exploited and de- veloped the subject so much further than I, that there was no justification for my announcing to the world, “Look what I did, first.”, 1975.

(33)

History of nonlinear programming

Fritz John (1910 - 1994)

1948. Extremum problems with inequalities as subsidiary conditions.

PhD 1933 in Göttingen under Courant New York University

Partial differential equations, convex geometry, nonlinear elasticity

(34)

History of nonlinear programming

Fritz John (1910 - 1994)

LetS be a bounded set inR^m. Find the sphere of least positive radius enclosingS.

MinimizeF(x) :=x_m+1, Subject to

G(x,y) :=x_m+1−Pm

i=1(xi−yi)²≥0for ally∈S.

the boundary of a compact convex setS in Rⁿ lies between two homothetic ellipsoids of ratio

≤n, and the outter ellipsoid can be taken to be the ellipsoid of least volume containingS.

(35)

Snell’s law of diffraction

sinθ_y

vy

=

^sin_v^θ^z

z

(36)

Snell’s law of diffraction

sinθ_y

vy

=

^sin_v^θ^z

z

Minimize T(x) := kx−yk

v_y +kx−zk v_z Subject to h(x) =0

At the solutionx^∗,∇T(x^∗) = _v ^x^∗^−y

yky−x^∗k +_v ^x^∗^−z

zkz−x^∗k is parallel to

∇h(x^∗), the normal vector to the surface.

Define¯y=x^∗+ _v ^y−x^∗

yky−x^∗k and¯z=x^∗+_v ^z−x^∗

zkz−x^∗k.

Hence−∇T(x^∗) = (¯y−x^∗) + (¯z−x^∗)is the diagonal of the following parallelogram:

(37)

Snell’s law of diffraction

sinθ_y

vy

=

^sin_v^θ^z

z

By triangular sim- ilarity, ¯y and ¯z are equally away from the normal line.

Hence

k¯y−x^∗ksinθy = k¯z−x^∗ksinθz. The calculation k¯y −x^∗k = _v¹

y and

k¯z−x^∗k= _v¹

z yields Snell’s law.

(38)

External Penalty Method

Choose a sequence{ρ_k}withρk →+∞and for eachksolve the problem

Minimize f(x) +ρkP(x), obtaining the (global) solutionx^k, if it exists.

Pis a smooth function P(x)≥0

P(x) =0⇔h(x) =0,g(x)≤0

For example:P(x) =kh(x)k²₂+kmax{0,g(x)}k²₂

(39)

External Penalty Method

Theorem:If{x^k}is well defined then every limit point of{x^k}is a global solution to MinimizeP(x)

Theorem:If{x^k}is well defined and there exists a point where the functionPvanishes (feasible region is not empty), then every limit point of{x_k}is a global solution of

Minimizef(x), Subject toh(x) =0,g(x)≤0.

The External Penalty Method can be used as a theoretical tool to prove KKT conditions, but also, it can be adjusted to be an efficient algorithm (augmented lagrangian method).

(40)

External Penalty Method

Minimize x²₁+x²₂, Subject to x₁−1=0

x₂−1≤0.

Minimize x²₁+x²₂+ρk((x1−1)²+max{0,x₂−1}²)(= Φk(x)).

Solving∇Φk(x) =0we getx^k = (_1+ρ^ρ^k

k,0)→(1,0).

Show simulation

(41)

Internal Penalty Method

Choose a sequence{µ_k}withµk →0+and for eachksolve the problem

Minimize f(x) +µkB(x), Subject to h(x) =0

g(x)<0.

Bis smooth B(x)≥0

B(x)→+∞if somegi(x)→0withg(x)<0.

For example:B(x) =−Pm

i=1log(−g_i(x))

(42)

Interior Point Method

Consider the convex quadratic problem Minimize c^Tx+¹₂x^TQx, Subject to Ax=b

x≥0.

and the barrier subproblem

Minimize c^Tx+¹₂x^TQx−µPn

j=1logx_j, Subject to Ax=b

x>0.

KKT condition

c−A^Tλ+Qx−µX⁻¹e=0,Ax=b,

whereX⁻¹=diag{x⁻¹₁ , . . . ,x⁻¹_n }ande= (1, . . . ,1)^T. Denoting s=µX⁻¹ewe get

A^Tλ+s−Qx=c,Ax=b, XSe=µe,(x,s)>0.

(43)

Interior Point Method

Active-set methods A^Tλ+s−Qx=c,

Ax=b,

XSe=0, (x,s)≥0.

Interior point methods A^Tλ+s−Qx=c,

Ax=b,

XSe=µe, (x,s)>0.

(44)

Interior Point Method

Complementarity: x_is_i =0,∀i=1, . . . ,n.

Active-set methodstry to guess the optimal active subset A⊆ {1, . . . ,n}and setxi =0fori∈A(active constraints),si=0 fori6∈A(inactive constraints).

Interior point methodsuseε-mathematics:

Replace xisi =0,∀i=1, . . . ,n by xisi =µ,∀i=1, . . . ,n.

Force convergence by lettingµ→0+.

(45)

Interior Point Method

Solve the nonlinear system of equations f(x, λ,s) =0,

wheref :R^2n+m→R^2n+mis the mapping:

f(x, λ,s) =





A^Tλ+s−Qx−c Ax−b

XSe−µe



.

(46)

Interior Point Method

Newton direction:





−Q A^T I

A 0 0

S 0 X



.





∆x

∆λ

∆s



=





c−A^Tλ−s+Qx b−Ax

µe−XSe



.

Reduceµat each Newton iteration.

(47)

Interior Point Method

Algorithm: Step 0: Choose(x0, λ0,s₀),(x0,s₀)>0, µ0 >0and parameters0< γ <1andε >0. Setk=0.

Step 1: Compute the Newton direction(∆x,∆λ,∆s)at (x, λ,s) := (xk, λk,s_k).

Step 2: Choose a stepsizeαsuch that(xk+α∆x,sk+α∆s)>0.

Step 3: Updateµ_k+1=γµk.

Step 4: Ifxksk ≤εx0s₀, stop. Else setk:=k+1and go to Step 1.

(48)

Interior Point Method

Consider the merit function ψ(x,s) = (n+√

n)log(x^Ts)−

n

X

i=1

(xis_i),

(Note thatψ(x,s)→ −∞ ⇒x^Ts→0.)

Choosing the stepsizeαthat minimizesψ(xk+α∆x,sk+α∆s) (exact line search) we get:

Theorem: Ifγ = _n+ⁿ^√_n, we havex^T_ksk ≤εx^T₀s₀ inO √

nlog ⁿ_ε iterations.

(49)

Algorithms

There are no “direct method” to solve NLP.

NLP is solved usingiterative methods.

An iterative method generates asequence of points

x^k∈Rⁿthatconverges(or not) to a solution of the problem.

Iterative methods areprogrammedand implemented on computers, where real mathematical operations are replaced by floating point operations.

(50)

Algorithms

Theoryis necessaryto avoid performing an infinite number of experiments.

Usefultheory should be able topredictthe behavior of many experiments.

Usually, the theory does not refer to the real sequences generated by the computer, but theoretical sequences defined by the algorithms.

The analogy betweenreal sequencesandtheoretical sequencesis not perfect.

There are practical phenomena that the theory is not able to predict, but relevant theory is the one that contributs in explaining practical phenomena.