• Nenhum resultado encontrado

Dual Averaging or Lazy Online Mirror Descent

Corollary 5.4.4 (Derived from Theorem 5.4.3). Let C:= (X,F)be an OCO instance such that X is closed and such that eachf ∈ F is a proper closed function which is subdifferentiable onX . Let R: (−∞,+∞]→Ebe a mirror map for X, let ENEMYbe an enemy oracle for C, and let T ∈N. Moreover, define

(x,f) := OCOC(AdaOMDXR,ENEMY, T),

Finally, letgt∈∂ft(xt) be as in the definition ofEOMDXR(f) on Algorithm 5.2 for each t∈[T]. If σ∈R++ is such thatR isσ-strongly convex w.r.t. a normk·k onE, then,

Regret(EOMDXR,f, u)≤BR(u, x1) + 1 2σ

T

X

t=1

kgtk2, ∀u∈X, (5.16) wherex0:=x1. In particular, if every function inFisρ-Lipschitz continuous w.r.t.k·kon a convex set D⊆Esuch thatX ⊆int(D), there isθ∈R++such thatθ≥sup{BR(, yy) : x∈X, y∈X∩int(domR)}, and R0 := ρ√

T /(√ 2σθ)

R is also a mirror map for X, then Regret(EOMDXR0,ENEMY, X)≤ρ

r2θT σ .

Proof. Note that EOMDXR = AdaOMDR where R is given by R(f) := [f =hi]R for every f ∈ Seq((−∞,+∞]E). Moreover, since R is mirror map for X, R is a mirror map strategy for C.

Therefore, the first inequality is a direct application of Theorem 5.4.3 together with the fact the Ris σ-strongly convex onX w.r.t. k·k.

If each f ∈ F isρ-Lipschitz continuous w.r.t.k·kon a convex set Dsuch thatX ⊆int(D), then by Theorem 3.8.4 we have that ∂f(x)⊆ {g∈E:kgk≤ρ} for eachf ∈ F and x∈X. Using this in (5.16) yields

RegretT(EOMDXR,ENEMY, u)≤BR(u, x1) +T ρ2 2σ .

Moreover, suppose there is θ∈R++ such thatθ≥sup{R(x)−R(y) : x∈X, y∈X∩domR}, and define

R0 := ρ√

√ T 2σθR.

Note that R0 is a (ρ√ σT /√

2θ)-strongly convex on X. Finally, suppose R0 is also a mirror map.

Then, pluggingR0 into the above inequality yields, for everyu∈X, RegretT(EOMDXR0,ENEMY, u)≤ ρ√

√ T

2σθBR(u, x1) +ρ√

√θT

2σ ≤ ρ√

√θT

2σ +ρ√

√θT

2σ =ρ r2θT

σ , where in the second inequality we took the supremum overu∈X.

yt−1 (as defined in Algorithm 5.1) back into the primal through the Bregman projection to get xt−1. Thus, one may wonder what happens if we are lazy and make the subgradient step directly fromyt instead of computing the gradient ofRt+1 at xt to only then make a subgradient step from

∇Rt+1(xt). Avoiding the computation of the gradient of the mirror map at every round (even though the algorithm still has to project the iterate from the dual to the primal space) may yield a drastic improvement in the time needed to compute each round in a practical implementation. This is exactly the idea of the Adaptive Dual Averaging(Adaptive DA or AdaDA) or Adaptive Lazy Online Mirror Descent algorithm. On Algorithm 5.3 we define the AdaDA oracle, which implements this algorithm. Moreover, on Figure 5.2 we present an schematic view of the computations done by AdaDA on round t+ 1(one may find it useful to compare this figure with Figure 5.1). The nameDual Averaging comes originally from the static version of this algorithm for classic convex optimization [56] (though it is not originally presented in the same way as presented in Algorithm 5.3).

Algorithm 5.3 Definition ofAdaDAXR hf1, . . . , fTi Input:

(i) A closed convex set X⊆E,

(ii) Convex functionsf1, . . . , fT ∈ F for someT ∈Nand set of convex functionsF ⊆(−∞,+∞]E such thatft is subdifferentiable on X for eacht∈[T],

(iii) R: Seq(F)→(−∞,+∞]E is a mirror map strategy for the OCO instance (X,F) which is differentiable on the open convex set D⊆E.

Output: xT+1∈D∩X r1← R(hi)

{x1} ←arg minx∈Xr1(x) y1 ←0.

fort= 1 to T do

. Computations for roundt+ 1

Definert+1 :=R(hf1, . . . , fti) and Rt+1:=Pt+1 i=1ri yt+1 :=yt−gt, wheregt∈∂ft(xt)

xt+1:= ΠRXt+1(∇Rt+1(yt+1)) return xT+1

One may note that, as for the AdaOMD oracle, we pass the set X ⊆ E where the player is supposed to pick her points as a parameter for AdaDA. However, one may note that this is not necessary due to Proposition 5.4.1, which says that using a mirror map strategy plus the indicator function ofX as a new mirror map renders the Bregman projection unnecessary. Still, we found that presenting the AdaDAoracle in the way most similar to theAdaOMDoracle is informative.

Finally, one may recall from the last section that making, at each round, a subgradient step from the gradient ofR computed at the previous iterate was one of the sources of complications in writing the Adaptive OMD as an application of the Adaptive FTRL oracle. Thus, we may hope the AdaDA oracle to have a much cleaner connection with AdaFTRL if compared to AdaOMD. The following theorem shows that this is indeed the case.

Theorem 5.5.1. Let C := (X,F) be an OCO instance such that X is closed and such that each f ∈ F is proper and closed. Let R: Seq(F) → (−∞,+∞]E be a mirror map strategy forC, let

Primal Dual

(previous round)

Figure 5.2: Graphic representation of the computations done by AdaDA on roundt+ 1.

ENEMY be an enemy oracle forC and letT ∈N. Define (x,f) := OCOC(AdaDAXR,ENEMY, T),

xT+1:= AdaDAXR(f), and Rt:=

t

X

i=1

R(hf1, . . . , fi−1i) for eacht∈ {1, . . . , T+ 1}.

Moreover, let gt ∈ ∂ft(xt) be the same as in the definition of AdaDAXR(f) on Algorithm 5.3 for each t∈[T]. IfRt is strongly convex9 on X for each t∈ {1, . . . , T + 1}, then,

{xt}= arg min

x∈X

Xt−1

i=1

hgi, xi+Rt(x)

(5.17) for each t∈ {1, . . . , T + 1}. Additionally set Fg := {f ∈ F :g∈∂f(x)for some x∈X} for each g∈E andL:={ hg,·i:g∈Es.t.Fg 6=∅}. Moreover, for every h∈ Land forgh :=∇h(0) (that is, h=hgh,·i), set

fh:=

(ft if gh=gt for some t∈[T],

some f ∈ Fgh otherwise, ∀h∈ L.

Finally, define

R0(h) :=R(hfh1, fh2, . . . , fhti) +δ(· |X) ∀h∈ Lt,∀t∈N. In this case,R0 is a FTRL regularizer strategy forC0 := (X,L)and

AdaDAXR(f1:t−1) = AdaFTRLR0(hhg1,·i, . . . ,hgt−1,·ii) ∀t∈ {1, . . . , T+ 1}.

Proof. Let t ∈[T]and let yt ∈E be as in the definition of AdaDAXR(f) in Algorithm 5.3. Since y1 = 0, by an easy induction one can see that yt = Pt−1

i=1gi. By the definition of AdaDA in

9We need this assumption in order to apply Proposition 5.4.1.

Algorithm 5.3 we have that xt= ΠRXt(∇Rt(yt)). SinceRt is strongly convex on X, by Lemma 3.11.4 we havext=∇Pt(yt), wherePt:=Rt+δ(· |X). Since Rt and δ(· |X)are closed (recall that X is closed), by Theorem 3.2.7 we know thatPt is closed. Therefore, by the properties of subgradients from Theorem 3.5.2 (namely items (ii) and (v)), and since {∇Pt(xt)}=∂Pt(xt) by Theorem 3.5.5, we have

xt=∇Pt(yt) ⇐⇒ {xt}= arg max

x∈E

(hyt, xi −Pt(x)) = arg min

x∈X

Xt−1

i=1

hgi, xi+Rt(x)) . In particular, define R0 as in the statement of the theorem. Let us first show that

R0 is a FTRL regularizer strategy forC0. (5.18) LetT0 ∈N, leth∈ LT0, and setR0 :=PT0+1

t=1 R0(h1:t−1). SinceRis a mirror map, sinceX is closed, and since the sum of closed and convex functions is also closed and convex by Theorem 3.2.7, we clearly have thatR0(h)is a closed proper convex function, that is,R0 satisfies condition (4.5.i) of a FTRL regularizer strategy forC0. Thus, we only need to show thatR is a classical FTRL regularizer forC0. With the same arguments we have just used, it is easy to see thatRis a proper closed convex function, that is, it satisfies property (4.4.i) of a FTRL regularizer forC. Moreover, we clearly have domR⊆X, which is condition (4.4.ii) of a FTRL regularizerC0. LetT00∈Nandh00∈ LT00. Note that by assumption we have that each mirror map increment R is strongly convex on X. Thus, R is strongly convex onE, which implies that R+PT00

t=1h00t is strongly convex onE and closed by Theorem 3.2.7 since it is the sum of closed and convex functions. Therefore, by Lemma 3.9.14 we know that infx∈E(R+PT00

t=1h00t) is attained, which finishes that proof of (5.18). Finally, note that for every t∈ {1, . . . , T + 1} we have

{xt}= arg min

x∈X

Xt−1

i=1

hgi, xi+Rt(x)

= arg min

x∈E

Xt−1

i=1

hgi, xi+

t

X

i=1

R(hf1, . . . , fi−1i)

(x) +δ(x|X)

= arg min

x∈E

Xt−1

i=1

hgi, xi+

t

X

i=1

R0 hhg1,·i, . . . ,hgi−1,·ii (x)

= AdaFTRLR0 hhg1,·i, . . . ,hgt−1,·ii .

The above theorem tell us something very interesting: Adaptive Dual Averaging with mirror map Ris closely related (actually, almost equivalent) to the Adaptive FTRL algorithm with regularizer strategyRwithδ(· |X) added to each mirror map increment applied to the linearized versions of the functions played by the enemy. The nameDual Averaging stems exactly from the equation between AdaDA and AdaFTRL given by the above theorem. Indeed, on the application of AdaFTRL on the above theorem we are minimizing over the set X the linear function given by the average10of the subgradients of the past functions plus a regularizer function.

This simplification done by the Adaptive Dual Averaging algorithm when compared to the Adaptive Online Mirror Descent does not come without its costs. Note that, by the last theorem, AdaDAworks like a general FTRL algorithm, while AdaOMDworks as a proximal FTRL algorithm.

10Even though we are looking at the sum of the subgradients at the formula, recall that we can scale the regularizer to effectively normalize this sum.

As discussed on Section 4.7, this may influence the efficiency or the amount of previous information needed by the oracle in some cases. We will look more carefully at some of these cases on Chapter 6.

Given such a clean connection of the Adaptive Dual Averaging algorithm and the Adaptive FTRL algorithm, it is of no surprise that regret bounds for the AdaFTRL oracle directly imply regret bound forAdaDA, as we show in the next corollary.

Corollary 5.5.2 (Derived from Theorems 4.4.3 and 5.5.1). Let C:= (X,F) be an OCO instance such thatX is closed and such that eachf ∈ F is a proper closed function which is subdifferentiable onX. Let R: Seq(F) → (−∞,+∞]E be a mirror map strategy for C, let ENEMY be an enemy oracle for C, and let T ∈N. Moreover, define

(x,f) := OCOC(AdaDAXR,ENEMY, T),

rt:=R(hf1, . . . , ft−1i), for each t∈[T].

Rt:=

t

X

i=1

ri, for each t∈[T].

Finally, letgt∈∂ft(xt) be as in the definition of AdaDAXR(hf1, . . . , fti)on Algorithm 5.3 for each t∈ [T], and suppose for each t∈ [T] there are σt ∈R++ and a normk·k(t) on E such that Rt is σt-strongly convex w.r.t.k·k(t) on E. Then,

Regret(AdaDAXR,f, u)≤

T

X

t=1

(rt(u)−rt(xt)) + 1 2

T

X

t=1

1

σtkgtk2(t),∗.

Proof. Defineht:=hgt,·ifor eacht∈[T], setL:={ hg,·i:f ∈ F, x∈X, g∈∂f(x)}, and define the OCO instanceC0 := (X,L). By Theorem 5.5.1, we know that there is a FTRL regularizer strategyR0 forC0 such that xt= AdaFTRLR0(hh1, . . . , ht−1i) for every t∈[T]. Therefore, by the subgradient inequality, for everyu∈X we have

Regret(AdaDAXR,f, u) =

T

X

t=1

(ft(xt)−ft(u))≤

T

X

t=1

hgt, xt−ui= Regret(AdaFTRLR0,h, u). (5.19) Moreover, by the definition ofR0 (see Theorem 5.5.1), we have

t

X

i=1

R0(h1:i−1) =

t

X

i=1

R(f1:i−1) +δ(· |X) =Rt+δ(· |X), ∀t∈[T]

Since Rt is σt-strongly convex w.r.t. the norm k·k(t) on E for every t ∈ [T], we have that R0 is σ-strong11 for h w.r.t. k·k1, . . . ,k·kT, where σ := hσ1, . . . , σti. Finally, since ∇ht(xt) = gt for each t∈[T], by the general AdaFTRL regret bound from Theorem 4.4.3 we have, for everyu∈X,

Regret(AdaFTRLR0,h, u)≤

T

X

t=1

(rt(u)−rt(xt)) + 1 2

T

X

t=1

1

σtkgtk2(t),∗.

In a way similar to what we have done for the AdaFTRL and AdaOMD algorithms, let us look at a version of AdaDA with a static regularizer, which we call (classical) Lazy Online Mirror Descent. We define an oracle which implements this algorithm in Algorithm 5.4.

11Note that the condition on the relative interior of the regularizer and the functions inLis trivially satisfied since all functions inLare finite everywhere.

Algorithm 5.4 Definition ofLOMDXR hf1, . . . , fTi Input:

(i) A closed convex set X⊆E;

(ii) Convex functions f1, . . . , fT ∈ F for some T ∈ N and F ⊆ (−∞,+∞]E such that fi is subdifferentiable onX for eachi∈[T];

(iii) A mirror map R:E→(−∞,+∞]for (X,F).

Output: xT+1∈int(domR)∩X {x1} ←arg minx∈XR(x) y1 ←0.

fort= 1 to T do

. Computations for roundt+ 1 Computegt∈∂ft(xt)

yt+1 ←yt−gt

xt+1←ΠRX(∇R(yt+1)) return xT+1

Corollary 5.5.3 (Derived from Corollary 5.5.2). Let C:= (X,F) be an OCO instance such thatX is closed and such that each f ∈ F is a proper closed function which is subdifferentiable onX. Let R:E→(−∞,+∞]be a mirror map for X, let ENEMYbe an enemy oracle for C, and let T ∈N. Moreover, define

(x,f) := OCOC(LOMDXR,ENEMY, T).

Finally, let gt∈∂ft(xt)be as in the definition of LOMDXR(f) on Algorithm 5.4 for eacht∈[T] and suppose there is σ ∈R++ and a norm k·k onE such that R is σ-strongly convex w.r.t. k·kon X.

Then,

Regret(LOMDXR,f, u)≤R(u)−R(x1) + 1 2σ

T

X

t=1

kgtk2, ∀u∈X, (5.20) In particular, if every function inF isρ-Lipschitz continuous w.r.t. k·kon a convex set D⊆Esuch that X ⊆ intD, if there is θ ∈ R++ such that θ ≥ sup{B(x, y) : x∈X, y∈X∩domR}, and if R0:= ρ√

T /(√ 2σθ)

R is also a mirror map forX, then

RegretT(LOMDR0,ENEMY, X)≤ρ r2θT

σ .

Proof. Note that LOMDR = AdaDAR where R is given by R(f) := [f =hi]R for every f ∈ Seq((−∞,+∞]E). Moreover, since R is mirror map for X, R is a mirror map strategy for C.

Therefore, the first inequality is a direct application of Corollary 5.5.2 together with the fact theR isσ-strongly convex on X w.r.t.k·k.

If each f ∈ F is ρ-Lipschitz continuous w.r.t.k·k on a convex setD such thatX ⊆intD, then by Theorem 3.8.4 we have that ∂f(x)⊆ {g∈E:kgk≤ρ} for eachf ∈ F and x∈X. Using this in (5.20) and the fact thatminx∈ER(x) =R(x1) yields

RegretT(LOMDR,ENEMY, u)≤R(u)−min

x∈XR(x) +T ρ2 2σ .

Moreover, suppose there is θ∈R++ such thatθ≥sup{R(x)−R(y) : x∈X, y∈X∩domR}, and define

R0 := ρ√

√ T 2σθR.

Note that R0 is a (ρ√ σT /√

2θ)-strongly convex on X. Suppose R0 is also a mirror map. Then, pluggingR0 into the above inequality yields, for every u∈X,

RegretT(LOMDR0,ENEMY, u)≤ ρ√

√ T

2σθ(R(u)−min

x∈XR(x)) +ρ√

√θT

2σ ≤ ρ√

√θT

2σ +ρ√

√θT

2σ =ρ r2θT

σ , where in the second inequality we took the supremum overu∈X.