• Nenhum resultado encontrado

The Online Newton Step Algorithm

By Lemma 3.1.1, h is concave if and only if0 ∇2h(x)for every x∈domh=X. The latter holds if and only if, for every x∈X,

α2e−αf(x)∇f(x)∇f(x)Tαe−αf(x)2f(x) ⇐⇒ α∇f(x)∇f(x)T2f(x).

Let us now look at some examples of exp-concave functions for the sake of concreteness. First, let us show that the functions from the sequential investment problem defined on Section 2.2.4 are all exp-concave.

Proposition 6.5.3. Let r∈Rd++, define the convex setX:={x∈Rd:1Tx >0}, and define the function f(x) :=−ln(rTx) +δ(x|X) for everyx∈Rd. Thenf is 1-exp-concave onX.

Proof. Sincex∈Rd7→rTx is two-times continuously differentiable onRd, since α∈R++7→ln(α) is two-times continuously differentiable on R++, and since rTx >0for every x∈X, we conclude thatf is two-times continuously differentiable on X. Moreover, note that

∇f(x) =− 1

rTxr, and ∇2f(x) = 1

(rTx)2rrT ∀x∈X.

Therefore, for everyx∈X,

2f(x)− ∇f(x)∇f(x)T= 1

(rTx)2rrT− 1

(rTx)2rrT= 0.

Thus, by Lemma 6.5.2 we conclude that f is1-exp-concave onX.

The next theorem, which we will use later as a tool to prove regret bounds, shows an inequality for exp-concave functions which is similar to the inequality for strongly convex functions given by Theorem 3.9.7. Namely, the latter theorem states that if a closed convex function f:Rd → (−∞,+∞]isα-strongly convex andf is subdifferentiable at x∈X, then

f(x)≥f(y) +gT(x−y) + α

2kx−yk22, ∀y∈X,∀g∈∂f(x).

The inequality we prove in the next theorem for exp-concave functions is similar to the one above.

The main difference is that instead of the squared norm, the inequality from the next theorem uses a “local norm”15 based on the gradient of the function. Before jumping into the lemma, we need a simple result which we prove next.

Lemma 6.5.4. For everyα ∈[−1/4,1/4], we have

−ln(1−α)≥α+1 4α2. Proof. For everyα∈Rdefine

f(α) :=−ln(1−α) and h(α) :=α+1 4α2.

Sincef(0) = 0 =h(0) and since bothf andh are differentiable on[−1/4,1/4], to prove f(α)≥h(α) for every α∈[−1/4,1/4]it suffices to prove f0(α)≥h0(α) for everyα∈[0,1/4] andf0(α)≤h0(α) for every α∈[−1/4,0). Note that, for every α∈[−1/4,1/4], since 1−α >0we have

f0(α)≥h0(α) ⇐⇒ 1

1−α ≥1 +1

2α ⇐⇒ 2≥(2 +α)(1−α)

⇐⇒ 2≥2−α−α2 ⇐⇒ α2+α≥0.

Sinceα2+α≥0 for every α∈[0,1/4]and α2+α≤0for every α∈[−1/4,0), we are done.

15Note that it is not a norm since the matrix in the inequality is a rank-one matrix.

Theorem 6.5.5. LetX ⊆Rd, letk·k be a norm onRd, and let α∈R++. Let f:Rd→(−∞,+∞]

be a closed convex function which isα-exp-concave onX,ρ-Lipschitz continuous w.r.t.k·konX, and differentiable on X. Moreover, suppose there isθ∈R++ such that supx,y∈Xkx−yk2≤θ. Finally, letβ ∈R++ be such thatβ ≤ 12min{(4ρ√

θ)−1, α}. Then, for any x, y∈X and g∈∂f(x) we have f(y)≥f(x) +∇f(x)T(y−x) +β

2(x−y)∇f(x)∇f(x)T(x−y).

Proof. Since 2β ≤α, we have that f is 2β-exp-concave on X. Thus, h := e−2βf(·)+δ(· |X) is a concave function. Letx, y∈X. Then, by the subgradient inequality,

h(y)≤h(x) +∇h(x)T(y−x) =⇒ e−2βf(y)≤e−2βf(x)−2βe−2βf(x)∇f(x)T(y−x)

=⇒ e−2βf(y) ≤e−2βf(x)(1−2β∇f(x)T(y−x))

=⇒ −2βf(y)≤ −2βf(x) + ln(1−2β∇f(x)T(y−x))

=⇒ f(y)≥f(x)− 1

2βln(1−2β∇f(x)T(y−x)).

(6.21)

By Theorems 3.5.5 and 3.8.4, we havek∇f(x)k ≤ρ. Hence,

2β∇f(x)T(y−x)≤2βk∇f(x)kkx−yk ≤2βρ

√ θ≤ 1

4. Thus, we can use Lemma 6.5.4 on (6.21), which yields

f(y)≥f(x)− 1

2β ln(1−2β∇f(x)T(y−x))

≥f(x) + 1 2β

2β∇f(x)T(y−x) +1

4(2β∇f(x)Ty−x)2

=f(x) +∇f(x)T(y−x) +β

2(∇f(x)Ty−x)2.

Finally, let us describe theOnline Newton Step (ONS) algorithm, which was first presented in [37].

We will show that, if the functions played by the enemy are guaranteed to be differentiable and exp-concave on the set from where the player picks her choices, then the ONS algorithm’s worst-case regret bound is logarithmic w.r.t. the number of rounds of the game. A player oracle which formally implements the ONS algorithm is defined in Algorithm 6.5.

One may have noticed some similarities between the above algorithm and theAdaGradalgorithm we have presented earlier in this chapter. The algorithm still maintains, at each roundt∈N\ {0}, a matrix constructed from rank-one updates based on the gradients of previous functions. On the other hand, how these matrices are used on the iterate updates are slightly different.

As before, let us see how to write ONS as an instance of the AdaReg algorithm. Again, in order to do so, we will pick a convex function onRd and transform it into a function on symmetric matrices by applying the function only to the eigenvalues (see Section 3.7 for details). Surprisingly, the meta-regularizer we will use is a multiple of X∈Sd++7→ −ln detX, a barrier function deeply connected with interior-point methods [58]. It is very interesting to see this connection, since the ONS algorithm did not seem to be based on any of the main concepts from interior-point methods when first proposed on [37].

Lemma 6.5.6. Letη >0, definef:Rd→(−∞,+∞]by f(x) :=−η

d

X

i=1

[xi>0] lnxi+δ(x|Rd+), ∀x∈Rd,

Algorithm 6.5 Definition ofONSX hf1, . . . , fTi Input:

(i) A closed and convex set∅6=X⊆Rd,

(ii) Convex functions f1, . . . , fT ∈ F for some T ∈ N and F ⊆ (−∞,+∞]Rd such that ft is differentiable on X for eacht∈[T],

(iii) Real numbersη, ε >0(usually clear from the context).

Output: xT+1∈X Define G0←εI

Let{x1} ←arg minx∈Xkxk2 fort= 1 to T do

. Computations for roundt+ 1 DefineGt←Gt−1+∇ft(xt)∇ft(xt)T xt+1←ΠGXt(xt−ηG−1t ∇ft(xt)) return xT+1

and setΦ :=fS. Then,

Φ(H) =−ηln det(H) and ∇Φ(H) =−ηH−1, ∀H∈Sd++, (6.22) and for every G∈Sd++ the infimuminfH

Sd++(hG, Hi+ Φ(H)) is attained byηG−1. Moreover, the functionΦis a meta-regularizer and we haveONSX = AdaRegXΦ for every nonempty closed convex setX ⊆Rd and for everyε >0, where the value ofη inONS is the same as in the definition off. Proof. LetH∈Sd++and setλ:=λ(H). First, let us show that (6.22) holds forH. By the definition of fS we have

fS(H) =f(λ) =−η

d

X

i=1

lnλi =−ηln

d

Y

i=1

λi =−ηln det(H),

where in the last equation we used Corollary 1.1.2. Let us now check that Φ is differentiable at H. Define Λ := Diag(λ). By the Spectral Decomposition Theorem (Theorem 1.1.1), there is an orthogonal matrix Q ∈ Rd×d such that H = QΛQT. Since f is differentiable on Rd++, by Corollary 3.7.5 we have thatΦ is differentiable onSd++ and that

∇Φ(H) =QDiag(∇f(λ))QT=−ηQΛ−1QT=−ηH−1. This ends the proof of (6.22). LetG∈Sd++. Let us now show that

{ηG−1}= arg min

H∈Sd++

(hH, Gi+ Φ(H)). (6.23)

LetHˆ ∈Sd++. By Theorem 3.6.2,Hˆ attains the above infimum if and only if 0 =G+∇Φ( ˆH) =G−ηHˆ−1 ⇐⇒ Hˆ =ηG−1. This proves (6.23).

Let us now show that

Φ is a meta-regularizer. (6.24)

Let T ∈Nandg∈(Rd)T. Moreover, letε >0and set GT−1:=εI+PT−1

t=1 gtgTt andGT :=GT−1+ gTgTT. Condition (6.3.i) is satisfied byΦ since, by (6.23), we know thatinfH

Sd++(H•GT + Φ(H)) is attained byηG−1T . Thus, set HT+1 :=ηG−1T andHT :=ηG−1T−1. Note that

HT−1+1−HT−1 = 1η(GT −GT−1) = 1ηgTgTT 0, that is,Φsatisfies condition (6.3.ii), which finishes the proof of (6.24).

Last but not least, let us show that, AdaGradX = AdaRegXΦ for any ε > 0 and any closed and convex set ∅ 6= X ⊆ Rd (recall that η > 0 is already given by the statement). Let f :=

hf1, . . . , fTi ∈Seq((−∞,+∞]Rd) be such that ft is closed, convex, and subdifferentiable onX for everyt∈[T]. Let us show that AdaGradX(f1:t−1) = AdaRegXΦ(f1:t−1) by induction on t∈[T]. Set x1 := AdaRegXΦ(hi) and let H1 be as in the definition of AdaRegXΦ(hi). By (6.23), we know that H1= (η/ε)I. Thus,

x1 ∈arg min

x∈X

kxkH−1

1 = arg min

x∈X

xTH1−1x= arg min

x∈X ε

ηxTx= arg min

x∈X

kxk2.

Since the squared`2-norm is strictly convex (see Lemma 3.9.5),x1 is the unique minimizer of the above minima and, thus,x1 = ONSX(hi). Lett∈ {2, . . . , T+1}and definexi:= AdaRegXΦ(f1:i−1) = ONSX(f1:i−1) for everyi∈ {1, . . . , t−1} (where the equation holds by induction). Moreover, for everyi∈ {1, . . . , t−1} letgi ∈Rd andGi∈Sd++ be as in the definition ofxt:= AdaRegXΦ(f1:t−1).

For every i ∈ {1, . . . , t−1}, since fi is differentiable on X, by Theorem 3.5.5 we conclude that gt−1 =∇ft−1(xt−1). Thus, Gt−1 is the same as the one in the definition ofONSX(f1:t−1). Finally, let Ht be as in the definition ofAdaRegXΦ(f1:t−1). Then,

xt= ΠH

−1 t

X (xt−1−Htgt−1)(6.23)= ΠGXt−1(xt−1−ηG−1t−1gt−1) = ONSX(f1:t−1).

Finally, let us show that ONS attains logarithmic regret (w.r.t. the number of rounds) when playing against an enemy who plays only differentiable and exp-concave functions. Before proving the regret itself on Theorem 6.5.8, we need to prove a simple lemma to bound the eigenvalues of the matricesGt∈Sn++ which the ONSoracle builds through its iterations.

Lemma 6.5.7. LetT ∈R+ andg1, . . . , gT ∈Rdbe such thatkgtk2≤ρfor everyt∈[T]. Moreover, let ε >0 and setG:=εI+PT

t=1gtgtT. Then, for every i∈[d],

λi(G)≤ρ2T +ε and det(G)≤(ρ2T+ε)d.

Proof. Leti∈[d]and letv∈Rd be an eigenvector of Gassociated with λi(G). Then, λi(G)v=Gv=εv+

T

X

t=1

gtgtTv.

Therefore,

λi(G)kvk22 =εkvk22+

T

X

t=1

(gtTv)2 ≤εkvk22+

T

X

t=1

kgtk22kvk22≤εkvk22+T ρ2kvk22.

Dividing the above inequality by kvk22 (which is nonzero since v is an eigenvector) yields the first bound from the statement. The bound on the determinant follows directly from Corollary 1.1.2, which shows that det(G) =Qd

i=1λi(G).

Theorem 6.5.8. Let C := (X,F) be an OCO instance such that X ⊆Rd is a nonempty closed set and such that each f ∈ F is a proper closed convex function. Moreover, suppose that there is a convex set D⊇X with nonempty interior such that every f ∈ F is differentiable on int(D), α-exp-concave on D, and ρ-Lipschitz continuous on D. Suppose there is θ ∈ R++ such that θ≥sup{ kx−uk22:x, u∈X} is finite. Define

β:= 1 2min

α, 1

4ρ√ θ

, η:= 1

β, and ε:= d η2θ. Finally, letT ∈N, letENEMY be an enemy oracle forC, and define

(x,f) := OCOC(ONSX,ENEMY, T).

Then,

Regret(ONSX,f, X)≤ 1

α +ρ√ θ

4d(1 +d−1+ lnT).

Proof. Defineh:Rd→(−∞,+∞]by h(x) :=−η

d

X

i=1

[xi>0] lnxi, ∀x∈Rd,

and setΦ :=hS. By Lemma 6.5.6, we haveΦ(H) =−ηln det(H)for everyH ∈Sd++andAdaRegXΦ = ONSX. Thus, we only need to bound the regret of AdaRegXΦ. By Theorem 3.5.5, for every t∈[T]

we have gt=∇ft(xt), where gt∈∂ft(xt)is as in in the definition of AdaRegXΦ(f). For eacht∈[T] definef˜t:Rd→(−∞,+∞]by

t(x) :=∇ft(xt)Tx, ∀x∈Rd.

Letu∈X. Since ftis α-exp-concave for each t∈[T], by Theorem 6.5.5 we have Regret(AdaRegXΦ,f, u) =

T

X

t=1

(ft(xt)−ft(u))≤

T

X

t=1

∇ft(xt)T(xt−u)−β 2

T

X

t=1

(∇ft(xt)T(xt−u))2

=

T

X

t=1

t(xt)−f˜t(u))−β 2

T

X

t=1

(∇ft(xt)T(xt−u))2

= Regret(AdaRegXΦ,f˜, u)−β 2

T

X

t=1

(∇ft(xt)T(xt−u))2,

where in the last inequality we have used the fact that, for everyt∈[T], we haveAdaRegXΦ(f1:t−1) = AdaRegXΦ(f˜1:t−1)since ∇f˜t(xt) =∇ft(xt). For each t∈ {1, . . . , T+ 1}, letHt, Gt−1 ∈Sd++ be as in the definition ofAdaRegXΦ(f˜), and defineDt:=Ht−1−[t >1]Ht−1−1. Thus, by Theorem 6.2.4 with x0 :=x1 we have

Regret(AdaRegXΦ,f˜, u)≤ 1 2

T

X

t=0

ku−xtk2D

t+1+1 2 min

H∈Sd++

(hGT, Hi+ Φ(H)−Φ(H1)).

Therefore,

Regret(AdaRegXΦ,f, u)≤ 1 2

XT

t=0

ku−xtk2Dt+1−η

T

X

t=1

(∇ft(xt)T(xt−u))2 +1

2 min

H∈Sd++

(hGT, Hi+ Φ(H)−Φ(H1)).

(6.25)

Let us bound each of the above terms separately. By Lemma 6.5.6, we have

Ht=ηG−1t−1 = 1βG−1t−1, ∀t∈ {1, . . . , T + 1}. (6.26) Hence, D1 = βεI (since G0 = εI and, thus, H1 = q(βε)−1I) and Dt+1 = β(Gt−Gt−1) = β∇ft(xt)∇ft−1(xt)T for eacht∈[T]. Thus,

T

X

t=0

kxt−uk2D

t+1=βεkx0−uk22

T

X

t=1

(∇ft(xt)T(xt−u))2≤βεθ+β

T

X

t=1

(∇ft(xt)T(xt−u))2

=⇒

T

X

t=0

kxt−uk2D

t+1−β

T

X

t=1

(∇ft(xt)T(xt−u))2 ≤βεθ.

Moreover, min

H∈Sd++

(hGT, Hi+ Φ(H)−Φ(H1))

=hGT, HT+1i+ Φ(HT+1)−Φ(H1)

= 1

β(Tr(I) + Φ(β−1G−1T )−Φ((βε)−1I)) by (6.26),

= 1

β(d−ln det(β−1G−1T ) + ln det((βε)−1I))

= 1

β(d+ ln det(βGT)−ln(βε)d) since det(A−1) = det(A)−1,

= 1 β

d+ lndet(βGT) (βε)d

≤ 1 β

d+ lnβd2T +ε)d βdεd

by Lemma 6.5.7,

= d β

1 + ln

ρ2T ε + 1

. By the definition ofβ, we have

1 β ≤2

1 α + 4ρ

√ θ

≤8 1

α +ρ

√ θ

=⇒ 1 2β ≤4

1 α +ρ

√ θ

.

Plugging these inequalities into (6.25) and using the definitions of εandη yield Regret(ONSX,f, u)≤ 1

2

βεθ+ d β

1 + ln

T ρ2 ε + 1

= 1 2

1 β +d

β 1 + ln T β2ρ2θ+ 1

= 1 2β

1 +d

1 + ln

T 64+ 1

≤ 1

2β(1 +d(1 + lnT))

= 1

2β(1 +d+dlnT)

≤ 1

α +ρ

√ θ

4d d−1+ 1 + lnT .

Different from the case of the AdaGradalgorithm, the player needs a lot of prior information about the problem, such as the Lipschitz and exp-concavity constants, to use the right parameters so that the above regret bound holds. In spite of this, the above regret is still impressive, since it is an exponential improvement (w.r.t. the number of rounds) if compared to the regret ofAdaGrad.