The Online Newton Step Algorithm - Online Convex Optimization: Algorithms, Learning, and Dualit

By Lemma 3.1.1, h is concave if and only if0 ∇²h(x)for every x∈domh=X. The latter holds if and only if, for every x∈X,

α²e^−αf(x)∇f(x)∇f(x)^Tαe^−αf^(x)∇²f(x) ⇐⇒ α∇f(x)∇f(x)^T ∇²f(x).

Let us now look at some examples of exp-concave functions for the sake of concreteness. First, let us show that the functions from the sequential investment problem defined on Section 2.2.4 are all exp-concave.

Proposition 6.5.3. Let r∈R^d₊₊, define the convex setX:={x∈R^d:1^Tx >0}, and define the function f(x) :=−ln(r^Tx) +δ(x|X) for everyx∈R^d. Thenf is 1-exp-concave onX.

Proof. Sincex∈R^d7→r^Tx is two-times continuously differentiable onR^d, since α∈R++7→ln(α) is two-times continuously differentiable on R++, and since r^Tx >0for every x∈X, we conclude thatf is two-times continuously differentiable on X. Moreover, note that

∇f(x) =− 1

r^Txr, and ∇²f(x) = 1

(r^Tx)²rr^T ∀x∈X.

Therefore, for everyx∈X,

∇²f(x)− ∇f(x)∇f(x)^T= 1

(r^Tx)²rr^T− 1

(r^Tx)²rr^T= 0.

Thus, by Lemma 6.5.2 we conclude that f is1-exp-concave onX.

The next theorem, which we will use later as a tool to prove regret bounds, shows an inequality for exp-concave functions which is similar to the inequality for strongly convex functions given by Theorem 3.9.7. Namely, the latter theorem states that if a closed convex function f:R^d → (−∞,+∞]isα-strongly convex andf is subdifferentiable at x∈X, then

f(x)≥f(y) +g^T(x−y) + α

2kx−yk²₂, ∀y∈X,∀g∈∂f(x).

The inequality we prove in the next theorem for exp-concave functions is similar to the one above.

The main difference is that instead of the squared norm, the inequality from the next theorem uses a “local norm”¹⁵ based on the gradient of the function. Before jumping into the lemma, we need a simple result which we prove next.

Lemma 6.5.4. For everyα ∈[−1/4,1/4], we have

−ln(1−α)≥α+1 4α². Proof. For everyα∈Rdefine

f(α) :=−ln(1−α) and h(α) :=α+1 4α².

Sincef(0) = 0 =h(0) and since bothf andh are differentiable on[−1/4,1/4], to prove f(α)≥h(α) for every α∈[−1/4,1/4]it suffices to prove f⁰(α)≥h⁰(α) for everyα∈[0,1/4] andf⁰(α)≤h⁰(α) for every α∈[−1/4,0). Note that, for every α∈[−1/4,1/4], since 1−α >0we have

f⁰(α)≥h⁰(α) ⇐⇒ 1

1−α ≥1 +1

2α ⇐⇒ 2≥(2 +α)(1−α)

⇐⇒ 2≥2−α−α² ⇐⇒ α²+α≥0.

Sinceα²+α≥0 for every α∈[0,1/4]and α²+α≤0for every α∈[−1/4,0), we are done.

15Note that it is not a norm since the matrix in the inequality is a rank-one matrix.

Theorem 6.5.5. LetX ⊆R^d, letk·k be a norm onR^d, and let α∈R++. Let f:R^d→(−∞,+∞]

be a closed convex function which isα-exp-concave onX,ρ-Lipschitz continuous w.r.t.k·konX, and differentiable on X. Moreover, suppose there isθ∈R++ such that sup_x,y∈Xkx−yk²≤θ. Finally, letβ ∈R++ be such thatβ ≤ ¹₂min{(4ρ√

θ)⁻¹, α}. Then, for any x, y∈X and g∈∂f(x) we have f(y)≥f(x) +∇f(x)^T(y−x) +β

2(x−y)∇f(x)∇f(x)^T(x−y).

Proof. Since 2β ≤α, we have that f is 2β-exp-concave on X. Thus, h := e^−2βf(·)+δ(· |X) is a concave function. Letx, y∈X. Then, by the subgradient inequality,

h(y)≤h(x) +∇h(x)^T(y−x) =⇒ e^−2βf(y)≤e^−2βf^(x)−2βe^−2βf^(x)∇f(x)^T(y−x)

=⇒ e^−2βf(y) ≤e^−2βf(x)(1−2β∇f(x)^T(y−x))

=⇒ −2βf(y)≤ −2βf(x) + ln(1−2β∇f(x)^T(y−x))

=⇒ f(y)≥f(x)− 1

2βln(1−2β∇f(x)^T(y−x)).

(6.21)

By Theorems 3.5.5 and 3.8.4, we havek∇f(x)k∗ ≤ρ. Hence,

2β∇f(x)^T(y−x)≤2βk∇f(x)k_∗kx−yk ≤2βρ

√ θ≤ 1

4. Thus, we can use Lemma 6.5.4 on (6.21), which yields

f(y)≥f(x)− 1

2β ln(1−2β∇f(x)^T(y−x))

≥f(x) + 1 2β

2β∇f(x)^T(y−x) +1

4(2β∇f(x)^Ty−x)²

=f(x) +∇f(x)^T(y−x) +β

2(∇f(x)^Ty−x)².

Finally, let us describe theOnline Newton Step (ONS) algorithm, which was first presented in [37].

We will show that, if the functions played by the enemy are guaranteed to be differentiable and exp-concave on the set from where the player picks her choices, then the ONS algorithm’s worst-case regret bound is logarithmic w.r.t. the number of rounds of the game. A player oracle which formally implements the ONS algorithm is defined in Algorithm 6.5.

One may have noticed some similarities between the above algorithm and theAdaGradalgorithm we have presented earlier in this chapter. The algorithm still maintains, at each roundt∈N\ {0}, a matrix constructed from rank-one updates based on the gradients of previous functions. On the other hand, how these matrices are used on the iterate updates are slightly different.

As before, let us see how to write ONS as an instance of the AdaReg algorithm. Again, in order to do so, we will pick a convex function onR^d and transform it into a function on symmetric matrices by applying the function only to the eigenvalues (see Section 3.7 for details). Surprisingly, the meta-regularizer we will use is a multiple of X∈S^d++7→ −ln detX, a barrier function deeply connected with interior-point methods [58]. It is very interesting to see this connection, since the ONS algorithm did not seem to be based on any of the main concepts from interior-point methods when first proposed on [37].

Lemma 6.5.6. Letη >0, definef:R^d→(−∞,+∞]by f(x) :=−η

i=1

[xi>0] lnxi+δ(x|R^d₊), ∀x∈R^d,

Algorithm 6.5 Definition ofONS^X hf₁, . . . , f_Ti Input:

(i) A closed and convex set∅6=X⊆R^d,

(ii) Convex functions f₁, . . . , f_T ∈ F for some T ∈ N and F ⊆ (−∞,+∞]^R^d such that f_t is differentiable on X for eacht∈[T],

(iii) Real numbersη, ε >0(usually clear from the context).

Output: x_T₊₁∈X Define G0←εI

Let{x₁} ←arg min_x∈Xkxk₂ fort= 1 to T do

. Computations for roundt+ 1 DefineGt←Gt−1+∇f_t(xt)∇f_t(xt)^T x_t+1←Π^G_X^t(x_t−ηG⁻¹_t ∇f_t(x_t)) return xT+1

and setΦ :=f_S. Then,

Φ(H) =−ηln det(H) and ∇Φ(H) =−ηH⁻¹, ∀H∈S^d₊₊, (6.22) and for every G∈S^d++ the infimuminf_H_∈

S^d++(hG, Hi+ Φ(H)) is attained byηG⁻¹. Moreover, the functionΦis a meta-regularizer and we haveONS^X = AdaReg^X_Φ for every nonempty closed convex setX ⊆R^d and for everyε >0, where the value ofη inONS is the same as in the definition off. Proof. LetH∈S^d++and setλ:=λ^↑(H). First, let us show that (6.22) holds forH. By the definition of f_S we have

f_S(H) =f(λ) =−η

i=1

lnλi =−ηln

i=1

λi =−ηln det(H),

where in the last equation we used Corollary 1.1.2. Let us now check that Φ is differentiable at H. Define Λ := Diag(λ). By the Spectral Decomposition Theorem (Theorem 1.1.1), there is an orthogonal matrix Q ∈ R^d×d such that H = QΛQ^T. Since f is differentiable on R^d++, by Corollary 3.7.5 we have thatΦ is differentiable onS^d₊₊ and that

∇Φ(H) =QDiag(∇f(λ))Q^T=−ηQΛ⁻¹Q^T=−ηH⁻¹. This ends the proof of (6.22). LetG∈S^d₊₊. Let us now show that

{ηG⁻¹}= arg min

H∈S^d++

(hH, Gi+ Φ(H)). (6.23)

LetHˆ ∈S^d₊₊. By Theorem 3.6.2,Hˆ attains the above infimum if and only if 0 =G+∇Φ( ˆH) =G−ηHˆ⁻¹ ⇐⇒ Hˆ =ηG⁻¹. This proves (6.23).

Let us now show that

Φ is a meta-regularizer. (6.24)

Let T ∈Nandg∈(R^d)^T. Moreover, letε >0and set G_T−1:=εI+PT−1

t=1 g_tg^T_t andG_T :=G_T−1+ g_Tg^T_T. Condition (6.3.i) is satisfied byΦ since, by (6.23), we know thatinf_H_∈

S^d++(H•G_T + Φ(H)) is attained byηG⁻¹_T . Thus, set H_T₊₁ :=ηG⁻¹_T andH_T :=ηG⁻¹_T₋₁. Note that

H_T⁻¹₊₁−H_T⁻¹ = ¹_η(G_T −G_T−1) = ¹_ηg_Tg^T_T 0, that is,Φsatisfies condition (6.3.ii), which finishes the proof of (6.24).

Last but not least, let us show that, AdaGrad^X = AdaReg^X_Φ for any ε > 0 and any closed and convex set ∅ 6= X ⊆ R^d (recall that η > 0 is already given by the statement). Let f :=

hf₁, . . . , f_Ti ∈Seq((−∞,+∞]^R^d) be such that f_t is closed, convex, and subdifferentiable onX for everyt∈[T]. Let us show that AdaGrad^X(f1:t−1) = AdaReg^X_Φ(f1:t−1) by induction on t∈[T]. Set x1 := AdaReg^X_Φ(hi) and let H1 be as in the definition of AdaReg^X_Φ(hi). By (6.23), we know that H₁= (η/ε)I. Thus,

x1 ∈arg min

x∈X

kxk_H⁻¹

1 = arg min

x∈X

x^TH₁⁻¹x= arg min

x∈X ε

ηx^Tx= arg min

x∈X

kxk₂.

Since the squared`₂-norm is strictly convex (see Lemma 3.9.5),x₁ is the unique minimizer of the above minima and, thus,x₁ = ONS^X(hi). Lett∈ {2, . . . , T+1}and definex_i:= AdaReg^X_Φ(f1:i−1) = ONS^X(f1:i−1) for everyi∈ {1, . . . , t−1} (where the equation holds by induction). Moreover, for everyi∈ {1, . . . , t−1} letg_i ∈R^d andG_i∈S^d++ be as in the definition ofx_t:= AdaReg^X_Φ(f1:t−1).

For every i ∈ {1, . . . , t−1}, since f_i is differentiable on X, by Theorem 3.5.5 we conclude that gt−1 =∇ft−1(xt−1). Thus, Gt−1 is the same as the one in the definition ofONS^X(f1:t−1). Finally, let Ht be as in the definition ofAdaReg^X_Φ(f1:t−1). Then,

x_t= Π^H

−1 t

X (xt−1−H_tgt−1)^(6.23)= Π^G_X^t−1(xt−1−ηG⁻¹_t−1gt−1) = ONS^X(f1:t−1).

Finally, let us show that ONS attains logarithmic regret (w.r.t. the number of rounds) when playing against an enemy who plays only differentiable and exp-concave functions. Before proving the regret itself on Theorem 6.5.8, we need to prove a simple lemma to bound the eigenvalues of the matricesGt∈Sⁿ₊₊ which the ONSoracle builds through its iterations.

Lemma 6.5.7. LetT ∈R+ andg₁, . . . , g_T ∈R^dbe such thatkg_tk₂≤ρfor everyt∈[T]. Moreover, let ε >0 and setG:=εI+PT

t=1gtg_t^T. Then, for every i∈[d],

λ^↑_i(G)≤ρ²T +ε and det(G)≤(ρ²T+ε)^d.

Proof. Leti∈[d]and letv∈R^d be an eigenvector of Gassociated with λ^↑_i(G). Then, λ^↑_i(G)v=Gv=εv+

t=1

gtg_t^Tv.

Therefore,

λ^↑_i(G)kvk²₂ =εkvk²₂+

t=1

(g_t^Tv)² ≤εkvk²₂+

t=1

kg_tk²₂kvk²₂≤εkvk²₂+T ρ²kvk²₂.

Dividing the above inequality by kvk²₂ (which is nonzero since v is an eigenvector) yields the first bound from the statement. The bound on the determinant follows directly from Corollary 1.1.2, which shows that det(G) =Qd

i=1λ^↑_i(G).

Theorem 6.5.8. Let C := (X,F) be an OCO instance such that X ⊆R^d is a nonempty closed set and such that each f ∈ F is a proper closed convex function. Moreover, suppose that there is a convex set D⊇X with nonempty interior such that every f ∈ F is differentiable on int(D), α-exp-concave on D, and ρ-Lipschitz continuous on D. Suppose there is θ ∈ R++ such that θ≥sup{ kx−uk²₂:x, u∈X} is finite. Define

β:= 1 2min

α, 1

4ρ√ θ

, η:= 1

β, and ε:= d η²θ. Finally, letT ∈N, letENEMY be an enemy oracle forC, and define

(x,f) := OCOC(ONS^X,ENEMY, T).

Then,

Regret(ONS^X,f, X)≤ 1

α +ρ√ θ

4d(1 +d⁻¹+ lnT).

Proof. Defineh:R^d→(−∞,+∞]by h(x) :=−η

i=1

[x_i>0] lnx_i, ∀x∈R^d,

and setΦ :=h_S. By Lemma 6.5.6, we haveΦ(H) =−ηln det(H)for everyH ∈S^d++andAdaReg^X_Φ = ONS^X. Thus, we only need to bound the regret of AdaReg^X_Φ. By Theorem 3.5.5, for every t∈[T]

we have gt=∇f_t(xt), where gt∈∂ft(xt)is as in in the definition of AdaReg^X_Φ(f). For eacht∈[T] definef˜_t:R^d→(−∞,+∞]by

f˜_t(x) :=∇f_t(x_t)^Tx, ∀x∈R^d.

Letu∈X. Since ftis α-exp-concave for each t∈[T], by Theorem 6.5.5 we have Regret(AdaReg^X_Φ,f, u) =

t=1

(f_t(x_t)−f_t(u))≤

t=1

∇f_t(x_t)^T(x_t−u)−β 2

t=1

(∇f_t(x_t)^T(x_t−u))²

t=1

f˜_t(x_t)−f˜_t(u))−β 2

t=1

(∇f_t(x_t)^T(x_t−u))²

= Regret(AdaReg^X_Φ,f˜, u)−β 2

t=1

(∇f_t(xt)^T(xt−u))²,

where in the last inequality we have used the fact that, for everyt∈[T], we haveAdaReg^X_Φ(f1:t−1) = AdaReg^X_Φ(f˜1:t−1)since ∇f˜t(xt) =∇f_t(xt). For each t∈ {1, . . . , T+ 1}, letHt, Gt−1 ∈S^d++ be as in the definition ofAdaReg^X_Φ(f˜), and defineD_t:=H_t⁻¹−[t >1]H_t−1⁻¹. Thus, by Theorem 6.2.4 with x₀ :=x₁ we have

Regret(AdaReg^X_Φ,f˜, u)≤ 1 2

t=0

ku−x_tk²_D

t+1+1 2 min

H∈S^d++

(hG_T, Hi+ Φ(H)−Φ(H₁)).

Therefore,

Regret(AdaReg^X_Φ,f, u)≤ 1 2

X^T

t=0

ku−x_tk²_D_t+1−η

t=1

(∇f_t(x_t)^T(x_t−u))² +1

2 min

H∈S^d++

(hG_T, Hi+ Φ(H)−Φ(H₁)).

(6.25)

Let us bound each of the above terms separately. By Lemma 6.5.6, we have

Ht=ηG⁻¹_t−1 = ¹_βG⁻¹_t−1, ∀t∈ {1, . . . , T + 1}. (6.26) Hence, D₁ = βεI (since G₀ = εI and, thus, H₁ = q(βε)⁻¹I) and D_t+1 = β(G_t−Gt−1) = β∇f_t(xt)∇f_t−1(xt)^T for eacht∈[T]. Thus,

t=0

kx_t−uk²_D

t+1=βεkx₀−uk²₂+β

t=1

(∇f_t(xt)^T(xt−u))²≤βεθ+β

t=1

(∇f_t(xt)^T(xt−u))²

=⇒

t=0

kx_t−uk²_D

t+1−β

t=1

(∇f_t(x_t)^T(x_t−u))² ≤βεθ.

Moreover, min

H∈S^d++

(hG_T, Hi+ Φ(H)−Φ(H₁))

=hG_T, H_T₊₁i+ Φ(H_T₊₁)−Φ(H₁)

= 1

β(Tr(I) + Φ(β⁻¹G⁻¹_T )−Φ((βε)⁻¹I)) by (6.26),

= 1

β(d−ln det(β⁻¹G⁻¹_T ) + ln det((βε)⁻¹I))

= 1

β(d+ ln det(βGT)−ln(βε)^d) since det(A⁻¹) = det(A)⁻¹,

= 1 β

d+ lndet(βG_T) (βε)^d

≤ 1 β

d+ lnβ^d(ρ²T +ε)^d β^dε^d

by Lemma 6.5.7,

= d β

1 + ln

ρ²T ε + 1

. By the definition ofβ, we have

1 β ≤2

1 α + 4ρ

√ θ

≤8 1

α +ρ

√ θ

=⇒ 1 2β ≤4

1 α +ρ

√ θ

Plugging these inequalities into (6.25) and using the definitions of εandη yield Regret(ONS^X,f, u)≤ 1

βεθ+ d β

1 + ln

T ρ² ε + 1

= 1 2

1 β +d

β 1 + ln T β²ρ²θ+ 1

= 1 2β

1 +d

1 + ln

T 64+ 1

≤ 1

2β(1 +d(1 + lnT))

= 1

2β(1 +d+dlnT)

≤ 1

α +ρ

√ θ

4d d⁻¹+ 1 + lnT .

Different from the case of the AdaGradalgorithm, the player needs a lot of prior information about the problem, such as the Lipschitz and exp-concavity constants, to use the right parameters so that the above regret bound holds. In spite of this, the above regret is still impressive, since it is an exponential improvement (w.r.t. the number of rounds) if compared to the regret ofAdaGrad.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 164-171)