ON REGULARIZATION AND ACTIVE-SET METHODS WITH

(1)

COMPLEXITY FOR CONSTRAINED OPTIMIZATION

2

E. G. BIRGIN^† AND J. M. MART´ıNEZ^‡ 3

Abstract. The main objective of this research is to introduce a practical method for smooth 4

bound-constrained optimization that possesses worst-case evaluation complexityO(ε^−3/2) for finding 5

anε-approximate first-order stationary point when the Hessian of the objective function is Lipschitz- 6

continuous. As other well established algorithms for optimization with box constraints, the algorithm 7

proceeds visiting the different faces of the domain aiming to reduce the norm of an internal projected 8

gradient and abandoning active constraints when no additional progress is expected in the current 9

face. The introduced method emerges as a particular case of a method for minimization with linear 10

constraints. Moreover, the linearly-constrained minimization algorithm is an instance of a mini- 11

mization algorithm with general constraints whose implementation may be unaffordable when the 12

constraints are complicated. As a procedure for leaving faces, it is employed a different method that 13

may be regarded as an independent device for constrained optimization. Such independent algorithm 14

may be employed to solve linearly-constrained optimization problem on its own, without relying on 15

the active-set strategy. A careful implementation and numerical experiments shows that the algo- 16

rithm that combines active sets with leaving-face iterations is more effective than the independent 17

algorithm on which leaving-face iterations are based, although both exhibits similar complexities 18

O(ε^−3/2).

19

Key words. Nonlinear programming, bound-constrained minimization, active-set strategies, 20

regularization, complexity.

21

AMS subject classifications. 90C30, 65K05, 49M37, 90C60, 68Q25.

22

1. Introduction. In this paper, we address the problem of minimizing a smooth

23

and generally nonconvex function within a region of the Euclidean finite dimensional

24

space. Initially, we will present a cubic regularization algorithm with cubic descent

25

that finds an approximate first-order stationary point with arbitrary precision ε or

26

a boundary point of the feasible region with evaluation complexity O(ε^−3/2). In ad-

27

dition, complexity for finding second-order stationary points and convergence results

28

will be presented. Secondly, we introduce an algorithm for minimization on arbitrary

29

and generally nonconvex regions defined by inequalities and equalities. The evaluation

30

complexity of this algorithm for finding first-order stationary points is alsoO(ε^−3/2),

31

when one uses cubic regularization of the functional quadratic approximation. The

32

most general form, in which we use pth Taylor polynomials to define subproblems,

33

has complexityO(ε^−(p+1)/p). A version of this algorithm for minimizing smooth func-

34

tions with linear constraints will be introduced and implemented, for the casep= 2.

35

Moreover, the problem of minimizing with linear constraints will be addressed also

36

in a different way. Namely, we will consider each face of the polytope as a finite di-

37

mensional region within which we may employ the algorithm firstly introduced in the

38

paper. As mentioned above, such algorithm either finds an approximate stationary

39

point within the face or finds a point on its boundary. Then, as a natural combination

40

of the two first algorithms so far introduced, we will use the first one for minimizing

41

within faces and the second one for giving up constraints, when the current face is

42

∗Submitted to the editors April 24, 2017.

Funding: Supported by FAPESP (grants 2013/03447-6, 2013/05475-7, 2013/07375-0, and 2014/18711-3) and CNPq (grants 309517/2014-1 and 303750/2014-6).

†Dept. of Computer Science, Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010, Cidade Universitária, 05508-090, São Paulo, SP, Brazil.egbirgin@ime.usp.br

‡Dept. of Applied Mathematics, Institute of Mathematics, Statistics, and Scientific Computing, State University of Campinas, 13083-859, Campinas, SP, Brazil. martinez@ime.unicamp.br

1

(2)

exhausted. This gives rise to a third algorithm with complexityO(ε^−3/2) for finding

43

ε-approximate first-order stationary points. This method will be implemented and

44

compared with the second one. Therefore, the present paper introduces an algorithm

45

with complexityO(ε^−3/2) for finding approximate KKT points of linearly-constrained

46

optimization problems, as a result of the manipulation of a theoretical algorithm for

47

constrained optimization. Theoretical algorithms for finding approximate KKT points

48

of general nonlinear programming problems with complexityO(ε^−3/2) have been in-

49

troduced [5], but practical counterparts are not yet known.

50

Several numerical optimization traditions converge on the present work. Algo-

51

rithms that use quadratic models for unconstrained problems and employ regularized

52

or trust-regions subproblems combined with actual-versus-predicted-reduction or cu-

53

bic descent with proven complexityO(ε^−3/2) were given in [8,16,17,18,20,21,28,33].

54

Cubic regularization was introduced as a useful optimization tool in [26,35]. The op-

55

timality of the complexity O(ε^−3/2) was proved in [18]. The generalization of cubic

56

regularization to arbitrary (p+ 1)th regularization was given in [6]. In [25], the com-

57

plexity (close to O(ε⁻²)) of quasi-Newton methods for unconstrained optimization

58

was analyzed. The complexity of unconstrained optimization and some constrained

59

optimization algorithms assuming relaxed Lipschitz (H¨older) conditions on the ob-

60

jective function was analyzed in [19, 24, 27]. The idea of minimizing on polytopes

61

using a fast algorithm within faces combined with a suitable procedure to abandon

62

exhausted faces is in the core of active-set methods for constrained optimization and

63

may be found in several textbooks [11, 22,34]. Many algorithms for solving general

64

constrained optimization problems use subproblems that consist on the minimization

65

of combinations of objective and constraint functions subject to linear constraints

66

[11, 32, 22, 34]. The combined algorithm presented here is inspired in the ideas of

67

[3, 9, 10], where the algorithm for leaving faces was the spectral projected gradient

68

method [1,12,13,14,15].

69

This paper is organized as follows. Section2 introduces the cubic regularization

70

method that either finds an interior stationary point or stops at the boundary of

71

the feasible region. First- and second-order complexity results are presented. In

72

Section3, we introduce a model algorithm with (p+1)-regularized models for nonlinear

73

programming. In Section 4, we describe the method for minimization with linear

74

constraints that combines the procedures of Sections2and3. In Section5, we compare

75

the combined algorithm against the method presented in Section 3 with p = 2. In

76

Section6, we state some conclusions and lines for future research.

77

Notation. Given Ω ⊆ Rⁿ, Int(Ω) denotes the set of interior points of Ω and ¯Ω

78

denotes its closure; if Ω is convex, PΩ(x) denotes the Euclidean projection ofxonto

79

Ω; ∇ and ∇² are the gradient and Hessian operators, respectively; k · k denotes

80

the Euclidean norm; N= {0,1,2, . . .}; g(x) denotes the gradient of f(x) and H(x)

81

denotes its Hessian;λ1(H) denotes the leftmost eigenvalue of the symmetric matrix

82

H; Hk denotes H(x^k); h⁰(x) denotes the Jacobian ofh : Rⁿ → R^q; and, if a ∈ R,

83

[a]+= max{a,0}.

84

2. Inner-to-the-face algorithm. Consider the problem

85

Minimizef(x) subject tox∈Ω.

86

We assume that Ω⊂Rⁿhas non-empty interior andf is a continuous function defined

87

on an open set that contains Ω. In this section, we introduce an algorithm that finds

88

an interior point that approximately satisfies first- and second-order conditions for

89

unconstrained minimization; or, alternatively, it finds a point on the boundary of Ω.

90

(3)

More precisely, either the algorithm generates a finite sequence x⁰, x¹, . . . , x^ˆ^k such

91

that the final point x^ˆ^k is on the boundary of Ω, x^k ∈ Int(Ω) for all k < ˆk, and

92

f(x^k^ˆ)< f(x^ˆ^k−1)<· · ·< f(x⁰), or it generates aninfinite sequence{x^k}^∞_k=0⊂Int(Ω)

93

such that {f(x^k)} is strictly decreasing, g(x^k) → 0 and [−λ1(H(x^k)]+ → 0 when

94

k→ ∞.

95

In the algorithm, for a given iterate x^k and a trial step s^trial, we consider the

96

interiority condition

97

(1) x^k+s^trial∈Int(Ω)

98

and the sufficient descent condition

99

(2) f(x^k+s^trial)≤f(x^k)−αks^trialk³.

100

The first step of the algorithm consists in computing, when possible, a solutions^trial

101

to the quadratic model

102

(3) Minimizeg(x^k)^Ts+1

2s^THks

103

and checking whether it satisfies (1,2) or not. Step 2 (that is executed when the first

104

steps is not successful) consists in, by increasing the regularizing parameter ρ in a

105

controlled way, finding a solutions^trial=s^trial(ρ) to the cubic regularized model

106

(4) Minimizeg(x^k)^Ts+1

2s^THks+ρksk³

107

that satisfies (1,2). Step 3 consists in verifying whether the regularizing parameter ρ

108

happened to be too large or not. Completing the verification may require the calcula-

109

tion of a point on the boundary of Ω and this task is performed at Step 4. At the end,

110

the computed new iterate may be a solution to (3) or (4) that belongs to the interior

111

of Ω and satisfies the sufficient descent condition or a point on the boundary of Ω.

112

In the latter case, the algorithm stops. The complete description of the algorithm

113

follows.

114

Algorithm2.1. Assume that x⁰∈Int(Ω), α >0, τ2 ≥τ1>1,M > ρmin>0,

115

andJ ≥0are given. Initialize k←0.

116

Step 1. Set j←0.

117

Step 1.1. If (3) is not solvable, go to Step 2.

118

Step 1.2. Defineρk,j = 0 ands^k,j as a solution to (3).

119

Step 1.3. If (1) does not hold with s^trial =s^k,j and j < J, compute, when possible,

120

a pointw on the boundary of Ω. If f(w)< f(x^k)then definex^k+1=w and

121

stop.

122

Step 1.4. If (1) and (2) hold with s^trial = s^k,j, define s^k = s^k,j, ρ_k = ρ_k,j, and

123

x^k+1 =x^k+s^k, set k←k+ 1, and go to Step 1. Otherwise, setj ←j+ 1

124

(and continue at Step 2 below).

125

Step 2. Set ρ¯←ρ_min.

126

Step 2.1. Defines^k,j as a solution to (4) withρ=ρ_k,j for someρ_k,j∈[τ₁ρ, τ¯ ₂ρ].¯

127

Step 2.2. If (1) does not hold with s^trial =s^k,j and j < J, compute, when possible,

128

a pointw on the boundary of Ω. If f(w)< f(x^k)then definex^k+1=w and

129

stop.

130

Step 2.3. If (1) or (2) does not hold with s^trial=s^k,j, set ρ¯←ρk,j,j ←j+ 1, and

131

go to Step 2.1.

132

(4)

Step 3. If

133

(5) ρk,j ≤M orj= 0 orx^k+s^k,j−1∈Int(Ω),

134

define s^k =s^k,j, ρ_k =ρ_k,j, andx^k+1 =x^k +s^k, set k ← k+ 1, and go to

135

Step 1.

136

Step 4. Defineσ₁= 3ks^k,j−1kρ_k,j−1 andσ₂= 3ks^k,jkρ_k,j.

137

Step 4.1. Computeσ∈[σ₁, σ₂) andw=x^k−(H_k+σI)⁻¹g(x^k)such thatw is on

138

the boundary ofΩ. (If x^k+s^k,j−1 is on the boundary ofΩthenσ=σ₁ and

139

w=x^k+s^k,j−1 is the natural choice.)

140

Step 4.2. If f(w)< f(x^k) then define x^k+1 =w and stop. Otherwise, defines^k =

141

s^k,j,ρk =ρk,j, andx^k+1=x^k+s^k and go to Step 1.

142

Note that, at Step 1.1, (3) is solvable if and only if H_k is positive definite or

143

H_k is positive semidefinite and g(x^k) belongs to the range space of H_k. Note also

144

that, at Step 2.1, we solve the cubic regularization problem (4) with a regularization

145

parameterρwhich is a priori unknown and may take any value betweenτ1ρ¯andτ2ρ.¯

146

This may be done using a root-finding process that aims to compute the quadratic

147

regularization parameter that corresponds to a cubic regularization one between the

148

given bounds (see [8] for details). Recall that, as shown in [17, Thm. 3.1], the set of

149

solutions of cubic regularized problems, varyingρ, coincides with the set of solutions

150

of quadratic regularized problems, varyingσ. At Steps 1.3 and 2.2, a magical step is

151

being considered and its definition may depend on characteristics of Ω. We have in

152

mind the case in which Ω is a closed and convex set and the projection onto it is an

153

affordable task. In this case, if (1) does not hold withs^trial =s^k,j, we may definew

154

as the projection ofx^k+s^k,j onto Ω. This trick has been proved to be very useful in

155

practice. A similar device is used in [31]. The number of magical steps per iteration

156

is limited toJ.

157

Algorithm2.1has been described in such a way that the only reason for stopping

158

is to find an iterate on the boundary of Ω. In any other case, the algorithm generates

159

an infinite sequence. Of course, in practice, stopping criteria are necessary (and they

160

will be defined later) but, in theory, we are interested on the behavior of the potentially

161

infinite sequence of iterates.

162

When the algorithm arrives to Step 3, a point x^k +s^k,j such that (1) and (2)

163

hold with s^trial = s^k,j has been computed. If, in addition, ρk,j ≤ M , x^k+s^k,j is

164

accepted as the new iterate. The obvious question is why we do not accept the trial

165

pointx^k+s^k,j, in spite of being interior and satisfying the sufficient descent condition,

166

when (5) does not hold. The reason is that, since ρk,j > M could be very big, s^k,j

167

could be unacceptably small and, so, sufficient descent could not mean satisfactory

168

descent. This situation may only happen whenj >0 andx^k+s^k,j−1∈/Int(Ω), which

169

means that the previous trial point at the present iteration was rejected without

170

testing its functional value because it was not interior. Acceptingx^k+s^k,j as a new

171

iterate or not involves computing an additional point at the boundary of Ω and this

172

task is performed at Step 4.

173

Recall that, when the algorithm executes Step 4, we are in a situation in which

174

x = x^k +s^k,j is interior and satisfies (2); whereas y = x^k +s^k,j−1, the previously

175

computed trial point, is not interior. Moreover, bothxandycome from solving cubic

176

regularization subproblems with parametersρxandρy(ρymay be null), respectively,

177

whereρx≥2ρy ≥0. This means that bothxandycome from solving quadratic reg-

178

ularization problems with regularization parameters σx=σ2 = 3ks^k,jkρk,j andσy =

179

σ1 = 3ks^k,j−1kρ_k,j−1, respectively, with σx > σy ≥ 0. At Step 4.2, it is assumed

180

(5)

that the root-finding process of computing a solution to a quadratic regularized sub-

181

problem withσ∈[σy, σx) that lies on the boundary of Ω is an affordable task. The

182

proposition below shows that σ andw that must be computed at Step 4.2 are well

183

defined.

184

Proposition 2.1. Assume that Ω⊂Rⁿ, ϕ : [a, b]→ Rⁿ is continuous, ϕ(a) ∈

185

Int(Ω), and ϕ(b) ∈/ Ω. Then, there exists t¯∈ [a, b] such that ϕ(¯t) belongs to the

186

boundary ofΩ.

187

Proof. Define ¯t ∈ [a, b] as the supremum of the values of t ∈ [a, b] such that

188

ϕ(t)∈Ω.

189

If ¯t=b, there exists a sequencet_k →bsuch thatϕ(t_k)∈Ω and, by continuity of

190

ϕ,ϕ(t_k)→ϕ(b). Then,ϕ(¯t) is on the boundary of Ω and we are done.

191

Consider the case in which ¯t < b. On the one hand, for all t ∈ [0,¯t), we have

192

that ϕ(t) belongs to Ω. On the other hand, by the definition of supremum, for all

193

t⁰ ∈(¯t,1], there existst⁰⁰∈[¯t, t⁰] such thatϕ(t⁰⁰) does not belong to Ω. Therefore, in

194

every neighborhood ofϕ(¯t) there are points that belong to Ω and points that do not

195

belong to Ω. By continuity ofϕ, this implies that ϕ(¯t) is on the boundary of Ω.

196

It must be noted that, by assuming that it is an affordable task to find a trial

197

pointwthat at the same time is a solution to a quadratic regularization subproblem

198

and it is at the boundary of Ω, a single functional evaluation is performed (at Step 4.2)

199

to decide whether to reach the boundary or to remain in the interior of Ω. Having

200

computed w, the algorithm checks whether f(w) < f(x^k) or not. In case it is, the

201

algorithm definesx^k+1=wand stops. Iff(w)≥f(x^k), this means that (2) does not

202

hold withs^trial=w−x^k. In this case, the algorithm acceptsx^k+s^k,j as new iterate.

203

This is because the fact thats^trial=w−x^kdoes not satisfy (2) reveals that the cubic

204

regularization parameterρ_k,j associated withx^k+s^k,j is not unnecessarily big and,

205

as a consequence, thatx^k+s^k,j is acceptable as new iterate.

206

AssumptionA1. There exists L > 0 such that for all x^k computed by Algo-

207

rithm2.1and all s^trial considered at (2) we have that

208

(6) f(x^k+s^trial)≤f(x^k) +g(x^k)^Ts^trial+1

2(s^trial)^THks^trial+Lks^trialk³.

209

Lemma 2.1. Suppose that Assumption A1 holds. If ρk,j ≥L+αthen (2) holds

210

withs^trial=s^k,j.

211

Proof. Ifρk,j ≥L+α, by AssumptionA1, we have that

212

f(x^k+s^k,j) ≤ f(x^k) +g(x^k)^Ts^k,j+¹₂(s^k,j)^THks^k,j+Lks^k,jk³

= f(x^k) +g(x^k)^Ts^k,j+¹₂(s^k,j)^THks^k,j+ (L+α)ks^k,jk³−αks^k,jk³

≤ f(x^k) +g(x^k)^Ts^k,j+¹₂(s^k,j)^THks^k,j+ρk,jks^k,jk³−αks^k,jk³. 213

Sinces^k,j being a solution to (4) withρ=ρk,j implies that

214

g(x^k)^Ts^k,j+1

2(s^k,j)^TH_ks^k,j+ρ_k,jks^k,jk³≤0,

215

(2) follows from the last inequality.

216

Lemma 2.2. Suppose that Assumption A1holds. Then, the sequence{x^k} gener-

217

ated by Algorithm 2.1 is well defined. Moreover, if the algorithm does not stop at a

218

boundary point at iteration k(in which case ρk is undefined), we have that

219

(7) ρk ≤max{M, τ2(L+α)}.

220

(6)

Proof. Proving that the sequence{x^k} is well defined consists in showing that,

221

given x^k ∈Int(Ω), the point x^k+1 is computed in finite time. If x^k+1 is a solution

222

to (3) computed at Step 1 or it is a point in the boundary of Ω computed at Steps 1.3,

223

2.2, or 4, we are done since those calculations involve a finite number of operations

224

by definition. We now assume thatx^k+1=x^k+s^k,j ands^k,j is a solution to (4) with

225

ρ =ρk,j (computed at Step 3). Since x^k is interior, for ρk,j large enough we have

226

thatx^k+s^k,j is interior too, meaning that (1) holds withs^trial=s^k,j. Moreover, by

227

Lemma2.1, ifρk,j≥L+αthen (2) holds withs^trial=s^k,j. Thus, since the updating

228

rule of ρk,j implies that ρk,j → ∞ when j → ∞, we have that x^k +s^k,j satisfying

229

(1,2) withs^trial=s^k,j is found in finite time.

230

Let us now prove the boundedness ofρ_k. Ifρ_k =ρ_k,j is defined at Step 3 then

231

we have that ρ_k ≤ M. Assume now that ρ_k = ρ_k,j is defined at Step 4.2. By the

232

definition of the algorithm, we have that ρ_k,j ∈ [τ₁ρ_k,j−1, τ₂ρ_k,j−1]. Moreover, the

233

point won the boundary of Ω was rejected. This point is of the form w=x^k+s_w

234

withs_w=−(H_k+σ)⁻¹g(x^k) a solution to the quadratic regularized subproblem

235

Minimizeg(x^k)^Ts+1

2s^TH_ks+σ 2ksk²

236

with σ∈[σ1, σ2), σ1 = 3ks^k,j−1kρ_k,j−1, and σ2 = 3ks^k,jkρk,j. This means (see [17,

237

Thm. 3.1]) thatsw is a solution to (4) with ρ=ρw=σ/(3kswk) satisfyingρ_k,j−1≤

238

ρw ≤ρk,j. The point w was rejected becausef(w) ≥f(x^k), which means that (2)

239

withs^trial =sw does not hold. Therefore, by Lemma2.1, we have thatρw 6≥L+α.

240

Thus,ρ_k,j−1< L+αand, in consequence, ρk,j < τ2(L+α) as we wanted to prove.

241

Lemma 2.3. If s^k = 0theng(x^k) = 0andHk is positive semidefinite.

242

Proof. Since

243

∇

g(x^k)^Ts+1 2s^TH_ks

|s=0=g(x^k)

244

and∇²

g(x^k)^Ts+¹₂s^TH_ks

|s=0=H_k, ifs^k= 0 solves (3) then we have thatg(x^k) =

245

0 andH_k is positive semidefinite.

246

The gradient of the objective function that defines (4) is g(x^k) +H_ks+ 3ρksks.

247

Therefore, if s^k = 0 is a solution to (4) then we have thatg(x^k) = 0. Moreover, for

248

alls∈Rⁿ, we must have

249

1

2s^TH_ks+ρksk³≥0,

250

otherwises^k= 0 would not be a solution to (4). Then, for all s6= 0,

251

1 2

s^TH_ks

ksk² +ρksk ≥0.

252

In particular, ifs6= 0 is an eigenvector associated withλ₁(H_k), it turns out that

253

1

2λ₁(H_k) +ρksk ≥0.

254

Taking limits fors→0 it follows thatλ1(Hk)≥0 as we wanted to prove.

255

Lemma 2.4. Suppose that Assumption A1holds, that Algorithm 2.1generates an

256

infinite sequence{x^k}^∞_k=0, and that{f(x^k)}^∞_k=0 is bounded below. Then,

257

(8) lim

k→∞ks^kk= 0

258

(7)

and

259

(9) lim

k→∞[−λ1(H_k)]₊= 0.

260

Proof. Since {f(x^k)} is bounded below, (8) follows from the fulfillment of the

261

sufficient descent condition (2) withs^trial=s^k for allk.

262

Moreover, s^k is a minimizer of g(x^k)s+ ¹₂s^THks+ρkksk³ and, by Lemma 2.3,

263

Hk is positive semidefinite if s^k = 0. If s^k 6= 0, the Hessian of the cubic function

264

g(x^k)s+¹₂s^THks+ρkksk³ must be positive semidefinite. This Hessian is

265

Hk+ 3ρk

s^k(s^k)^T

ks^kk +ks^kkI

.

266

Thus,

267

−λ1

Hk+ 3ρk

s^k(s^k)^T

ks^kk +ks^kkI

+

≥0.

268

Sinces^k →0 and, by Lemma2.2, ρk is bounded, we have that (9) holds.

269

AssumptionA2. Assumption A1 holds and, for all x^k ands^k computed by Al-

270

gorithm 2.1,

271

(10) kg(x^k+s^k)k ≤(L+ 3ρ_k)ks^kk².

272

A sufficient condition for the fulfillment of AssumptionA2comes from assuming

273

thatg(x^k+s^k) is well represented by its linear approximation, namely,

274

(11) kg(x^k+s^k)−g(x^k)−H(x^k)s^kk ≤Lks^kk².

275

In this case, AssumptionA2holds due to the gradient annihilation of the subproblem

276

at Step 2. Moreover, a sufficient condition for the fulfillment of both AssumptionsA1

277

and A2 is the fulfillment of a Lipschitz condition by the Hessian H(x). It is inter-

278

esting to note that AssumptionA2 requires (10) to be verified only with respect to

279

incrementss^kthat satisfy the interiority and the sufficient descent condition; whereas

280

AssumptionA1requires (6) to hold at every trial increments^trial.

281

Lemma 2.5. Suppose that Assumption A2holds. Then, for allx^k ands^k gener-

282

ated by Algorithm2.1such that x^k+1=x^k+s^k ∈Int(Ω), we have that

283

(12) f(x^k+1)≤f(x^k)−α

kg(x^k+1)k

L+ 3 max{M, τ2(L+α)}

^3/2 .

284

Proof. By AssumptionA2and Lemma2.2, we have that

285

(13) kg(x^k+1)k ≤(L+ 3 max{M, τ2(L+α)})ks^kk².

286

Thus, (12) follows from (13) and the fact that (2) holds withs^trial=s^k.

287

Theorem 2.1. Suppose that Assumption A2 holds and let ftarget ∈ R, εg >0,

288

and ε_h > 0 be arbitrary. Let ρmax = max{M, τ2(L+α)}. Then, the number of

289

iterations at which

290

f(x^k)> ftarget and kg(x^k)k> εg 291

(8)

is, at most,

292

(14)

$

f(x⁰)−ftarget

α

kεgk L+ 3ρmax

−3/2% .

293

Moreover, the number of iterations at which

294

f(x^k)> f_target and λ₁(H_k)<−ε_h

295

is, at most,

296

(15)

$f(x⁰)−ftarget

α

ε_h 6ρ_max

−3% .

297

Finally, at each iterationk, the number of functional evaluations is bounded by

298

J+

log_τ₁ ρmax

ρ_min

+ 2.

299

Proof. The maximum number of iterations (14) follows directly from Lemma2.5.

300

The steps^kis a minimizer ofg(x^k)^Ts+¹₂s^THks+ρkksk³, where, by Lemma2.2,

301

ρk ≤ρmax. Thus, if s^k 6= 0, the Hessian ofg(x^k)^Ts+¹₂s^THks+ρkksk³ is positive

302

semidefinite ats=s^k. By direct calculations, we see that this Hessian is given by

303

H_k+ 3ρ_k

s^k(s^k)^T

ks^kk +ks^kkI

.

304

Letv^k be an eigenvector ofHk associated withλ1(Hk) andkv^kk= 1. Then, by the

305

positive definiteness ofHk+ 3ρh_sk(s^k)^T

ks^kk +ks^kkIi

, we have that

306

0 ≤ v_k^T

Hk+ 3ρk

h_sk(s^k)^T

ks^kk +ks^kkIi v^k

≤ λ₁(H_k) + 3ρ_k

(v_k^Ts_k)²/ks^kk+ks^kk

≤ λ1(Hk) + 3ρk

ks^kk²/ks^kk+ks^kk

≤ λ₁(H_k) + 6ρ_maxks^kk.

307

Thus,ks^kk ≥ −λ1(Hk)/(6ρmax) or, equivalently,−ks^kk³≤(λ1(Hk)/(6ρmax))³. There-

308

fore, since (2) withs^trial=s^k holds for allk, we have that

309

f(x^k+1)≤f(x^k)−αks^kk³≤f(x^k) +α([λ1(Hk)]/(6ρmax))³,

310

from which (15) follows.

311

In order to establish the number of functional evaluations per iteration k, first

312

note that, while the interiority condition (1) is not being satisfied withs^trial=s^k,j, on

313

the one hand there is no need to check whether the descent condition (2) holds with

314

s^trial=s^k,jand, on the other hand, simple decrease may be checked at no more thanJ

315

magical steps. This means that, at every iteration k, while (1) does not hold with

316

s^trial =s^k,j, at most J functional evaluations are performed. Once (1) is satisfied,

317

the limit on the number of functional evaluations follows from the boundedness ofρk 318

for allk, given by Lemma2.2, the fact thatρk,0= 0 by definition, and the updating

319

rules forρk,j that guarantees that (a) if (3) is solvable then ρk,j ≥2^j−1ρmin for all

320

j ≥1 and (b) if (3) is not solvable then ρk,j ≥2^jρmin for all j ≥0. This completes

321

the proof.

322

(9)

Theorem 2.2. Suppose that Assumption A2 holds and that the sequence {x^k}

323

generated by Algorithm 2.1 is bounded. Then, either the sequence stops at some

324

boundary point x^k^ˆ such that f(x^ˆ^k)< f(x^k−1^ˆ )<· · ·< f(x⁰); or an infinite sequence

325

is generated such that

326

lim

k→∞kg(x^k)k= 0 and lim

k→∞[−λ1(H_k)]₊= 0.

327

Proof. If the sequence stops at a boundary point x^ˆ^k then f(x^k^ˆ) < f(x^ˆ^k−1) <

328

· · ·< f(x⁰) follows by the definition of the algorithm. Assume that the sequence does

329

not stop at a boundary point. Since the sequence is bounded, by continuity, there

330

exists fbound ∈ R such that f(x^k) > fbound for all k. Define ftarget = fbound. Let

331

ε >0 be arbitrary. By Theorem2.1, taking εg =ε we see that there exists k0 such

332

that, for all k ≥k0, kg(x^k)k ≤ ε. The fact that lim_k→∞[−λ1(Hk)]+ = 0 has been

333

proved in Lemma2.4. This completes the proof.

334

3. High-order algorithms for constrained optimization. Consider the prob-

335

lem

336

(16) Minimizef(x) subject tox∈D,

337

where D ⊂ Rⁿ represents an arbitrary set. In this section, we introduce a class of

338

methods for solving (16). The algorithms to be introduced in this section can be used

339

in connection with the algorithm introduced in Section2in different ways. Therefore,

340

this class of algorithms may be seen as independent procedures for solving the main

341

problem (16) or as auxiliary devices for continuing Algorithm2.1when “a face” ofD

342

should be abandoned.

343

Algorithm3.1 below is a high-order algorithm in which each iteration is defined

344

by the approximate minimization of thepth Taylor approximation of the functionf

345

around the iteratex^kalong the lines of [6]. Iff :Rⁿ→Rwith continuous derivatives

346

up to orderp∈ {1,2,3, . . .}, the Taylor polynomial of order pwill be written in the

347

form

348

(17) Tp(x, s) =

p

X

j=1

Pj(x, s),

349

whereP_j(x, s) is an homogeneous polynomial of degreej given by

350

(18) Pj(x, s) = 1

j!

s1

∂

∂x1

+· · ·+sn

∂

∂xn

j

f(x).

351

For example,T1(x, s) =g(x)^TsandT2(x, s) =g(x)^Ts+¹₂s^TH(x)s.

352

Algorithm3.1. Assume that p∈ {1,2,3, . . .}, α >0, ρmin >0, τ2 ≥τ1 >1,

353

θ >0, andx⁰∈D are given. Initialize k←0.

354

Step 1. Set ρ←0.

355

Step 2. Compute s^trial∈Rⁿ such that

356

(19) x^k+s^trial∈D

357

and

358

(20) Tp(x^k, s^trial) +ρks^trialk^p+1≤0.

359

(10)

Step 3. Test the condition

360

(21) f(x^k+s^trial)≤f(x^k)−αks^trialk^p+1.

361

If (21) holds, define s^k =s^trial,ρ_k =ρ, and x^k+1 =x^k+s^k, setk←k+ 1,

362

and go to Step 1. Otherwise, updateρ←max{ρmin, τ ρ} withτ ∈[τ1, τ2], and

363

go to Step 2.

364

The trial increment s^trial is intended to be an approximate solution to the sub-

365

problem

366

(22) MinimizeTp(x^k, s) +ρksk^p+1 subject tox^k+s∈D.

367

The condition (20) is the minimal condition that should be imposed to the approx-

368

imate solution to (22) in order to obtain meaningful results; although an obvious

369

choice that satisfies (19), (20), and (21) iss^trial= 0, which is not useful at all and it

370

will be discarded later.

371

Lemma 3.1. Assume that the sequence {x^k} is generated by Algorithm 3.1. If

372

{f(x^k)} is bounded below then

373

lim

k→∞ks^kk= 0.

374

Proof. The proof follows straightforwardly from the hypotheses of the lemma

375

and (21).

376

The following assumption coincides with AssumptionA1in the casep= 2.

377

AssumptionA3. There exists L > 0 such that for all x^k computed by Algo-

378

rithm3.1and all s^trial considered at (21) we have that

379

(23) f(x^k+s^trial)≤f(x^k) +T_p(x^k, s^trial) +Lks^trialk^p+1.

380

Lemma 3.2. Suppose that Assumption A3holds. If the regularization parameter

381

ρin (20) satisfiesρ≥L+αthen a trial steps^trial that satisfies (20) also satisfies the

382

sufficient descent condition (21).

383

Proof. By AssumptionA3,

384

f(x^k+s^trial) ≤ f(x^k) +Tp(x^k, s^trial) +Lks^trialk^p+1

= f(x^k) +Tp(x^k, s^trial) +ρks^trialk^p+1−ρks^trialk^p+1+Lks^trialk^p+1. 385

Then, by (20),

386

f(x^k+s^trial)≤f(x^k)−ρks^trialk^p+1+Lks^trialk^p+1.

387

Therefore, ifρ≥L+α, (21) holds.

388

AssumptionA4. Assumption A3 holds and, for all x^k ands^k computed by Al-

389

gorithm 3.1, we have that

390

(24) kg(x^k+s^k)− ∇sT_p(x^k, s^k)k ≤Lks^kk^p.

391

AssumptionsA3andA4are satisfied if thepth derivatives off satisfy a Lipschitz

392

condition (see [6]). As in the case of AssumptionsA1andA2, observe that Assump-

393

tionA3must hold for every trial increments^trial; whereas AssumptionA4must hold

394

only at the accepted incrementss^k.

395

(11)

AssumptionA5. The setD is defined by

396

(25) D={x∈Rⁿ|hi(x)≤0 for all i= 1, . . . , q},

397

where the functionsh_i are continuously differentiable. Moreover, every feasible point

398

of (25) satisfies a constraint qualification.

399

Assumption A5 implies that, for any function ϕ : Rⁿ →R, if z is a minimizer

400

of ϕ subject to z ∈ D, associated KKT conditions hold. For the sake of simplicity

401

and without loss of generality, we formulated the definition of D only in terms of

402

inequality constraints. Implicitly, we assume that if equality constraints are present,

403

they are expressed as pairs of inequalities.

404

For allx∈D, we define

405

(26) L(D, x) ={z∈Rⁿ|hi(x) +h⁰_i(x)(z−x)≤0 for all i= 1, . . . , q}.

406

We say thatL(D, x) is the linear approximation of D aroundx. Givenx∈ D and

407

a function ϕ, we will denote∇Dϕ(x) =PL(D,x)(x− ∇ϕ(x))−x.Direct calculations

408

show thatxsatisfies the KKT conditions of the problem of minimizingϕ ontoD if,

409

and only if, ∇Dϕ(x) = 0. If there existsx^k →x^∗ such that ∇Dϕ(x^k)→0, we say

410

that x^∗ satisfies the Approximate Gradient Projection (AGP) sequential optimality

411

condition [2,29].

412

AssumptionA6. There exists θ >0 such that, for allk∈N, approximate solu-

413

tionss^k to the subproblem (22) satisfy

414

(27)

∇D

Tp(x^k, x−x^k) +ρkx−x^kk^p+1 _x=x_k_+s_k

≤θks^kk^p.

415

Assumption A6 states that the accepted increment s^k satisfies, approximately,

416

an AGP optimality condition. If the constraints satisfy some constraint qualification,

417

every minimizer of (22) satisfies the AGP condition and, consequently, also fulfills (27).

418

Therefore, AssumptionA6states the degree of accuracy with which one wishes to solve

419

the subproblems (and this is the assumption that eliminates the possibility of taking

420

s^k = s^trial = 0). Note that the degree of accuracy (27) is required not to all the

421

subproblems but only to the ones that, ultimately, define algorithmic progress.

422

Lemma 3.3. Suppose that Assumptions A4, A5, and A6 hold and that the se-

423

quence {x^k} is generated by Algorithm3.1. Then, at each iteration k, we have that

424

(28)

∇Df(x^k+s^k)

≤(L+τ2(L+α) (p+ 1) +θ)ks^kk^p.

425

Proof. By the definition of∇D, we have that

426

(29)

k∇Df(x^k+s^k)− ∇DT_p(x^k, x−x^k)|_x=xk+s^kk

= kP_L(D,xk+s^k)

(x^k+s^k)−g(x^k+s^k)

− P_L(D,xk+s^k)

(x^k+s^k)− ∇Tp(x^k, x−x^k)|_x=xk+s^k

k

≤ k(x^k+s^k)−g(x^k+s^k)− (x^k+s^k)− ∇T_p(x^k, x−x^k)|_x=xk+s^k

k

= kg(x^k+s^k)− ∇Tp(x^k, x−x^k)|_x=xk+s^kk ≤Lks^kk^p,

427

where the first inequality follows from the contraction property of projections and the

428

(12)

last inequality follows from AssumptionA4. This means that

429

k∇Df(x^k+s^k)k

≤ Lks^kk^p+k∇DTp(x^k, x−x^k)|_x=xk+s^kk

≤ (L+θ)ks^kk^p+ρk∇D

kx−x^kk^p+1

|_x=xk+s^kk

= (L+θ)ks^kk^p+ρkP_L(D,xk+s^k)

(x^k+s^k)− ∇

kx−x^kk^p+1

|_x=xk+s^k

−(x^k+s^k)k

≤ (L+θ)ks^kk^p+ρk(x^k+s^k)− ∇

kx−x^kk^p+1

|_x=xk+s^k−(x^k+s^k)k

= (L+θ)ks^kk^p+ρk∇

kx−x^kk^p+1

|_x=xk+s^kk

= (L+θ+ρ(p+ 1))ks^kk^p, 430

where the first inequality follows from (29), the second inequality follows from As-

431

sumption A6, and the third inequality follows from the fact x^k +s^k belongs to

432

L(D, x^k +s^k) and, therefore, for any z ∈ Rⁿ, P_L(D,xk+s^k)(z) is closer to x^k +s^k

433

thanz itself. Finally, by Lemma3.2, we have thatρ≤τ₂(L+α), which implies the

434

desired result.

435

Lemma 3.4. Suppose that Assumptions A4, A5, and A6 hold and that the se-

436

quence {x^k} is generated by Algorithm3.1. Then, at each iteration k, we have that

437

(30) f(x^k+1)≤f(x^k)−α

k∇Df(x^k+1)k L+τ₂(L+α) (p+ 1) +θ

^(p+1)/p .

438

Proof. The result follows straightforwardly from Lemma3.3and (21).

439

Theorem 3.1. Suppose that AssumptionsA4,A5, and A6hold and that the se-

440

quence {x^k} is generated by Algorithm 3.1. Let f_target ∈R andε_g >0 be arbitrary.

441

Then, the number of iterationsksuch that

442

f(x^k)> f_target and k∇_Df(x^k+1)k ≥ε_g

443

is not greater than

444

(31)

$f(x⁰)−ftarget

α

εg

L+τ₂(L+α) (p+ 1) +θ

−(p+1)/p%

445

and the number of functional evaluations per iteration is bounded above by

446

log_τ₁

τ2(L+α) ρ_min

+ 1.

447

Finally, if{f(x^k)} is bounded below,

448

(32) lim

k→∞k∇Df(x^k+1)k= 0.

449

Proof. The maximum number of iterations (31) follows from Lemma 3.4. The

450

bound on the number of functional evaluations per iteration follows from the updating

451

rule for ρand Lemma3.2; while (32) follows from Lemma3.4 and the boundedness

452

of{f(x^k)}.

453

AssumptionA7. For allk∈N,s^k is a global minimizer ofTp(x^k, s) +ρkksk^p+1

454

subject to x^k+s∈D.

455

(13)

If the constraintsh(x)≤0 are simple enough, AssumptionA7implies Assump-

456

tionA6.

457

Theorem 3.2. Suppose that AssumptionA4,A5,A6, andA7 hold and that the

458

sequence {x^k} is generated by Algorithm 3.1. Letx^∗ be a limit point of{x^k}. Then,

459

there exists ρ_∗ ∈ [0, τ2(L+α)] such that s= 0 is a global minimizer of Tp(x^∗, s) +

460

ρ_∗ksk^p+1 subject to x^∗+s∈D.

461

Proof. By Lemma 3.2, ρk ∈ [0, τ2(L+α)] for allk ∈N. Therefore, there exists

462

ρ_∗∈[0, τ2(L+α)] and an infinite sequence of indices Ksuch that

463

lim

k∈Kρ_k=ρ_∗ and lim

k∈Kx^k=x^∗.

464

Lets∈Rⁿ such thatx^k+s∈D be arbitrary. By AssumptionA7, we have that

465

Tp(x^k, s^k) +ρkks^kk^p+1≤Tp(x^k, s) +ρkksk^p+1.

466

Taking limits in this inequality fork∈K and using the continuity of the derivatives

467

off up to order p, the fact that, by Lemma3.1,s^k→0 andρ_k →ρ_∗, we obtain that

468

T_p(x^∗,0) +ρ_∗k0k^p+1≤T_p(x^∗, s) +ρ_∗ksk^p+1.

469

Sincessatisfyingx^∗+s∈Dis arbitrary, we obtain the desired result.

470

Recall that, if a point x^∗ is an unconstrained local minimizer of a function f,

471

we have that such point is p-stationary. A pointx∈ Rⁿ will be said to beq-order

472

stationary (with 1≤q≤p) if it is (q−1)-order stationary and, for all v ∈Rⁿ such

473

that P0(x, v) =· · · =Pj−1(x, v) = 0, one has that Pj(x, v)≥0. By convention, we

474

will say that every pointx∈Rⁿ is 0-order stationary andP0(x, v) = 0.

475

Corollary 3.1. Assume that q = 0 (that is, D = Rⁿ and the problem is un-

476

constrained). Under the hypotheses of Theorem 3.2, the limit point x^∗ is m-order

477

stationary for allm≤p.

478

Proof. By Theorem 3.2, s= 0 is q-order stationary for the function Tp(x^∗, s) +

479

ρ∗ksk^p+1. But the conditions ofq-order stationarity of this function ats= 0 are the

480

same as the ones off atx^∗since all their derivatives up to orderpcoincide.

481

Anm-order stationary condition for local minimization ofϕ(x) subject tox∈D

482

is a property that involves derivatives up to order mofϕ as well as properties ofD

483

and must be satisfied by any local minimizer ofϕ. Since, for all m≤p, them-order

484

partial derivatives off atx^∗ coincide with those ofT_p(x^∗, s) +ρ_∗ksk^p+1 ats= 0, we

485

may extend Corollary3.1to the constrained optimization case as follows.

486

Corollary 3.2. Under the hypotheses of Theorem 3.2, the limit point x^∗ is m-

487

order stationary for allm≤p.

488

4. Linearly-constrained optimization. In this section, we consider the prob-

489

lem

490

(33) Minimizef(x) subject tox∈D,

491

whereD⊂Rⁿ is a polytope defined by

492

D={x∈Rⁿ|(aⁱ)^Tx≤bi for alli= 1, . . . , q}.

493