Dual Averaging or Lazy Online Mirror Descent

Corollary 5.4.4 (Derived from Theorem 5.4.3). Let C:= (X,F)be an OCO instance such that X is closed and such that eachf ∈ F is a proper closed function which is subdifferentiable onX . Let R: (−∞,+∞]→Ebe a mirror map for X, let ENEMYbe an enemy oracle for C, and let T ∈N. Moreover, define

(x,f) := OCOC(AdaOMD^X_R,ENEMY, T),

Finally, letg_t∈∂f_t(x_t) be as in the definition ofEOMD^X_R(f) on Algorithm 5.2 for each t∈[T]. If σ∈R++ is such thatR isσ-strongly convex w.r.t. a normk·k onE, then,

Regret(EOMD^X_R,f, u)≤B_R(u, x₁) + 1 2σ

t=1

kg_tk²_∗, ∀u∈X, (5.16) wherex0:=x1. In particular, if every function inFisρ-Lipschitz continuous w.r.t.k·kon a convex set D⊆Esuch thatX ⊆int(D), there isθ∈R++such thatθ≥sup{B_R(, yy) : x∈X, y∈X∩int(domR)}, and R⁰ := ρ√

T /(√ 2σθ)

R is also a mirror map for X, then Regret(EOMD^X_R⁰,ENEMY, X)≤ρ

r2θT σ .

Proof. Note that EOMD^X_R = AdaOMDR where R is given by R(f) := [f =hi]R for every f ∈ Seq((−∞,+∞]^E). Moreover, since R is mirror map for X, R is a mirror map strategy for C.

Therefore, the first inequality is a direct application of Theorem 5.4.3 together with the fact the Ris σ-strongly convex onX w.r.t. k·k.

If each f ∈ F isρ-Lipschitz continuous w.r.t.k·kon a convex set Dsuch thatX ⊆int(D), then by Theorem 3.8.4 we have that ∂f(x)⊆ {g∈E:kgk_∗≤ρ} for eachf ∈ F and x∈X. Using this in (5.16) yields

Regret_T(EOMD^X_R,ENEMY, u)≤BR(u, x1) +T ρ² 2σ .

Moreover, suppose there is θ∈R++ such thatθ≥sup{R(x)−R(y) : x∈X, y∈X∩domR}, and define

R⁰ := ρ√

√ T 2σθR.

Note that R⁰ is a (ρ√ σT /√

2θ)-strongly convex on X. Finally, suppose R⁰ is also a mirror map.

Then, pluggingR⁰ into the above inequality yields, for everyu∈X, Regret_T(EOMD^X_R⁰,ENEMY, u)≤ ρ√

√ T

2σθBR(u, x1) +ρ√

√θT

2σ ≤ ρ√

√θT

2σ +ρ√

√θT

2σ =ρ r2θT

σ , where in the second inequality we took the supremum overu∈X.

yt−1 (as defined in Algorithm 5.1) back into the primal through the Bregman projection to get xt−1. Thus, one may wonder what happens if we are lazy and make the subgradient step directly fromyt instead of computing the gradient ofRt+1 at xt to only then make a subgradient step from

∇R_t+1(x_t). Avoiding the computation of the gradient of the mirror map at every round (even though the algorithm still has to project the iterate from the dual to the primal space) may yield a drastic improvement in the time needed to compute each round in a practical implementation. This is exactly the idea of the Adaptive Dual Averaging(Adaptive DA or AdaDA) or Adaptive Lazy Online Mirror Descent algorithm. On Algorithm 5.3 we define the AdaDA oracle, which implements this algorithm. Moreover, on Figure 5.2 we present an schematic view of the computations done by AdaDA on round t+ 1(one may find it useful to compare this figure with Figure 5.1). The nameDual Averaging comes originally from the static version of this algorithm for classic convex optimization [56] (though it is not originally presented in the same way as presented in Algorithm 5.3).

Algorithm 5.3 Definition ofAdaDA^X_R hf₁, . . . , fTi Input:

(i) A closed convex set X⊆E,

(ii) Convex functionsf1, . . . , fT ∈ F for someT ∈Nand set of convex functionsF ⊆(−∞,+∞]^E such thatf_t is subdifferentiable on X for eacht∈[T],

(iii) R: Seq(F)→(−∞,+∞]^E is a mirror map strategy for the OCO instance (X,F) which is differentiable on the open convex set D⊆E.

Output: x_T₊₁∈D∩X r₁← R(hi)

{x₁} ←arg min_x∈Xr1(x) y₁ ←0.

fort= 1 to T do

. Computations for roundt+ 1

Definer_t+1 :=R(hf₁, . . . , f_ti) and R_t+1:=Pt+1 i=1r_i y_t+1 :=y_t−g_t, whereg_t∈∂f_t(x_t)

xt+1:= Π^R_X^t+1(∇R^∗_t+1(yt+1)) return xT+1

One may note that, as for the AdaOMD oracle, we pass the set X ⊆ E where the player is supposed to pick her points as a parameter for AdaDA. However, one may note that this is not necessary due to Proposition 5.4.1, which says that using a mirror map strategy plus the indicator function ofX as a new mirror map renders the Bregman projection unnecessary. Still, we found that presenting the AdaDAoracle in the way most similar to theAdaOMDoracle is informative.

Finally, one may recall from the last section that making, at each round, a subgradient step from the gradient ofR computed at the previous iterate was one of the sources of complications in writing the Adaptive OMD as an application of the Adaptive FTRL oracle. Thus, we may hope the AdaDA oracle to have a much cleaner connection with AdaFTRL if compared to AdaOMD. The following theorem shows that this is indeed the case.

Theorem 5.5.1. Let C := (X,F) be an OCO instance such that X is closed and such that each f ∈ F is proper and closed. Let R: Seq(F) → (−∞,+∞]^E be a mirror map strategy forC, let

Primal Dual

(previous round)

Figure 5.2: Graphic representation of the computations done by AdaDA on roundt+ 1.

ENEMY be an enemy oracle forC and letT ∈N. Define (x,f) := OCOC(AdaDA^X_R,ENEMY, T),

xT+1:= AdaDA^X_R(f), and R_t:=

i=1

R(hf₁, . . . , fi−1i) for eacht∈ {1, . . . , T+ 1}.

Moreover, let gt ∈ ∂ft(xt) be the same as in the definition of AdaDA^X_R(f) on Algorithm 5.3 for each t∈[T]. IfR_t is strongly convex⁹ on X for each t∈ {1, . . . , T + 1}, then,

{x_t}= arg min

x∈X

X^t−1

i=1

hg_i, xi+Rt(x)

(5.17) for each t∈ {1, . . . , T + 1}. Additionally set F_g := {f ∈ F :g∈∂f(x)for some x∈X} for each g∈E andL:={ hg,·i:g∈Es.t.F_g 6=∅}. Moreover, for every h∈ Land forg_h :=∇h(0) (that is, h=hg_h,·i), set

fh:=

(f_t if g_h=g_t for some t∈[T],

some f ∈ F_g_h otherwise, ∀h∈ L.

Finally, define

R⁰(h) :=R(hf_h₁, f_h₂, . . . , f_h_ti) +δ(· |X) ∀h∈ L^t,∀t∈N. In this case,R⁰ is a FTRL regularizer strategy forC⁰ := (X,L)and

AdaDA^X_R(f1:t−1) = AdaFTRLR⁰(hhg₁,·i, . . . ,hg_t−1,·ii) ∀t∈ {1, . . . , T+ 1}.

Proof. Let t ∈[T]and let y_t ∈E be as in the definition of AdaDA^X_R(f) in Algorithm 5.3. Since y1 = 0, by an easy induction one can see that yt = Pt−1

i=1gi. By the definition of AdaDA in

9We need this assumption in order to apply Proposition 5.4.1.

Algorithm 5.3 we have that x_t= Π^R_X^t(∇R^∗_t(y_t)). SinceR_t is strongly convex on X, by Lemma 3.11.4 we havex_t=∇P_t^∗(y_t), whereP_t:=R_t+δ(· |X). Since R_t and δ(· |X)are closed (recall that X is closed), by Theorem 3.2.7 we know thatPt is closed. Therefore, by the properties of subgradients from Theorem 3.5.2 (namely items (ii) and (v)), and since {∇P_t^∗(x_t)}=∂P_t^∗(x_t) by Theorem 3.5.5, we have

x_t=∇P_t^∗(y_t) ⇐⇒ {x_t}= arg max

x∈E

(hy_t, xi −P_t(x)) = arg min

x∈X

X^t−1

i=1

hg_i, xi+R_t(x)) . In particular, define R⁰ as in the statement of the theorem. Let us first show that

R⁰ is a FTRL regularizer strategy forC⁰. (5.18) LetT⁰ ∈N, leth∈ L^T⁰, and setR⁰ :=PT⁰+1

t=1 R⁰(h1:t−1). SinceRis a mirror map, sinceX is closed, and since the sum of closed and convex functions is also closed and convex by Theorem 3.2.7, we clearly have thatR⁰(h)is a closed proper convex function, that is,R⁰ satisfies condition (4.5.i) of a FTRL regularizer strategy forC⁰. Thus, we only need to show thatR is a classical FTRL regularizer forC⁰. With the same arguments we have just used, it is easy to see thatRis a proper closed convex function, that is, it satisfies property (4.4.i) of a FTRL regularizer forC. Moreover, we clearly have domR⊆X, which is condition (4.4.ii) of a FTRL regularizerC⁰. LetT⁰⁰∈Nandh⁰⁰∈ L^T⁰⁰. Note that by assumption we have that each mirror map increment R is strongly convex on X. Thus, R is strongly convex onE, which implies that R+P_T⁰⁰

t=1h⁰⁰_t is strongly convex onE and closed by Theorem 3.2.7 since it is the sum of closed and convex functions. Therefore, by Lemma 3.9.14 we know that infx∈E(R+PT⁰⁰

t=1h⁰⁰_t) is attained, which finishes that proof of (5.18). Finally, note that for every t∈ {1, . . . , T + 1} we have

{x_t}= arg min

x∈X

X^t−1

i=1

hg_i, xi+Rt(x)

= arg min

x∈E

X^t−1

i=1

hg_i, xi+

i=1

R(hf₁, . . . , fi−1i)

(x) +δ(x|X)

= arg min

x∈E

X^t−1

i=1

hg_i, xi+

i=1

R⁰ hhg₁,·i, . . . ,hg_i−1,·ii (x)

= AdaFTRL_R⁰ hhg₁,·i, . . . ,hg_t−1,·ii .

The above theorem tell us something very interesting: Adaptive Dual Averaging with mirror map Ris closely related (actually, almost equivalent) to the Adaptive FTRL algorithm with regularizer strategyRwithδ(· |X) added to each mirror map increment applied to the linearized versions of the functions played by the enemy. The nameDual Averaging stems exactly from the equation between AdaDA and AdaFTRL given by the above theorem. Indeed, on the application of AdaFTRL on the above theorem we are minimizing over the set X the linear function given by the average¹⁰of the subgradients of the past functions plus a regularizer function.

This simplification done by the Adaptive Dual Averaging algorithm when compared to the Adaptive Online Mirror Descent does not come without its costs. Note that, by the last theorem, AdaDAworks like a general FTRL algorithm, while AdaOMDworks as a proximal FTRL algorithm.

10Even though we are looking at the sum of the subgradients at the formula, recall that we can scale the regularizer to effectively normalize this sum.

As discussed on Section 4.7, this may influence the efficiency or the amount of previous information needed by the oracle in some cases. We will look more carefully at some of these cases on Chapter 6.

Given such a clean connection of the Adaptive Dual Averaging algorithm and the Adaptive FTRL algorithm, it is of no surprise that regret bounds for the AdaFTRL oracle directly imply regret bound forAdaDA, as we show in the next corollary.

Corollary 5.5.2 (Derived from Theorems 4.4.3 and 5.5.1). Let C:= (X,F) be an OCO instance such thatX is closed and such that eachf ∈ F is a proper closed function which is subdifferentiable onX. Let R: Seq(F) → (−∞,+∞]^E be a mirror map strategy for C, let ENEMY be an enemy oracle for C, and let T ∈N. Moreover, define

(x,f) := OCOC(AdaDA^X_R,ENEMY, T),

r_t:=R(hf₁, . . . , ft−1i), for each t∈[T].

Rt:=

i=1

ri, for each t∈[T].

Finally, letgt∈∂ft(xt) be as in the definition of AdaDA^X_R(hf₁, . . . , fti)on Algorithm 5.3 for each t∈ [T], and suppose for each t∈ [T] there are σ_t ∈R++ and a normk·k_(t) on E such that R_t is σt-strongly convex w.r.t.k·k_(t) on E. Then,

Regret(AdaDA^X_R,f, u)≤

t=1

(r_t(u)−r_t(x_t)) + 1 2

t=1

σ_tkg_tk²_(t),∗.

Proof. Defineht:=hg_t,·ifor eacht∈[T], setL:={ hg,·i:f ∈ F, x∈X, g∈∂f(x)}, and define the OCO instanceC⁰ := (X,L). By Theorem 5.5.1, we know that there is a FTRL regularizer strategyR⁰ forC⁰ such that x_t= AdaFTRLR⁰(hh₁, . . . , ht−1i) for every t∈[T]. Therefore, by the subgradient inequality, for everyu∈X we have

Regret(AdaDA^X_R,f, u) =

t=1

(ft(xt)−ft(u))≤

t=1

hg_t, xt−ui= Regret(AdaFTRLR⁰,h, u). (5.19) Moreover, by the definition ofR⁰ (see Theorem 5.5.1), we have

i=1

R⁰(h1:i−1) =

i=1

R(f_1:i−1) +δ(· |X) =R_t+δ(· |X), ∀t∈[T]

Since R_t is σ_t-strongly convex w.r.t. the norm k·k_(t) on E for every t ∈ [T], we have that R⁰ is σ-strong¹¹ for h w.r.t. k·k₁, . . . ,k·k_T, where σ := hσ₁, . . . , σ_ti. Finally, since ∇h_t(x_t) = g_t for each t∈[T], by the general AdaFTRL regret bound from Theorem 4.4.3 we have, for everyu∈X,

Regret(AdaFTRLR⁰,h, u)≤

t=1

(r_t(u)−r_t(x_t)) + 1 2

t=1

σ_tkg_tk²_(t),∗.

In a way similar to what we have done for the AdaFTRL and AdaOMD algorithms, let us look at a version of AdaDA with a static regularizer, which we call (classical) Lazy Online Mirror Descent. We define an oracle which implements this algorithm in Algorithm 5.4.

11Note that the condition on the relative interior of the regularizer and the functions inLis trivially satisfied since all functions inLare finite everywhere.

Algorithm 5.4 Definition ofLOMD^X_R hf₁, . . . , f_Ti Input:

(i) A closed convex set X⊆E;

(ii) Convex functions f1, . . . , fT ∈ F for some T ∈ N and F ⊆ (−∞,+∞]^E such that fi is subdifferentiable onX for eachi∈[T];

(iii) A mirror map R:E→(−∞,+∞]for (X,F).

Output: xT+1∈int(domR)∩X {x₁} ←arg min_x∈XR(x) y₁ ←0.

fort= 1 to T do

. Computations for roundt+ 1 Computeg_t∈∂f_t(x_t)

yt+1 ←yt−gt

xt+1←Π^R_X(∇R^∗(yt+1)) return x_T₊₁

Corollary 5.5.3 (Derived from Corollary 5.5.2). Let C:= (X,F) be an OCO instance such thatX is closed and such that each f ∈ F is a proper closed function which is subdifferentiable onX. Let R:E→(−∞,+∞]be a mirror map for X, let ENEMYbe an enemy oracle for C, and let T ∈N. Moreover, define

(x,f) := OCOC(LOMD^X_R,ENEMY, T).

Finally, let g_t∈∂f_t(x_t)be as in the definition of LOMD^X_R(f) on Algorithm 5.4 for eacht∈[T] and suppose there is σ ∈R++ and a norm k·k onE such that R is σ-strongly convex w.r.t. k·kon X.

Then,

Regret(LOMD^X_R,f, u)≤R(u)−R(x1) + 1 2σ

t=1

kg_tk²_∗, ∀u∈X, (5.20) In particular, if every function inF isρ-Lipschitz continuous w.r.t. k·kon a convex set D⊆Esuch that X ⊆ intD, if there is θ ∈ R++ such that θ ≥ sup{B(x, y) : x∈X, y∈X∩domR}, and if R⁰:= ρ√

T /(√ 2σθ)

R is also a mirror map forX, then

Regret_T(LOMD_R⁰,ENEMY, X)≤ρ r2θT

σ .

Proof. Note that LOMD_R = AdaDAR where R is given by R(f) := [f =hi]R for every f ∈ Seq((−∞,+∞]^E). Moreover, since R is mirror map for X, R is a mirror map strategy for C.

Therefore, the first inequality is a direct application of Corollary 5.5.2 together with the fact theR isσ-strongly convex on X w.r.t.k·k.

If each f ∈ F is ρ-Lipschitz continuous w.r.t.k·k on a convex setD such thatX ⊆intD, then by Theorem 3.8.4 we have that ∂f(x)⊆ {g∈E:kgk_∗≤ρ} for eachf ∈ F and x∈X. Using this in (5.20) and the fact thatminx∈ER(x) =R(x₁) yields

Regret_T(LOMD_R,ENEMY, u)≤R(u)−min

x∈XR(x) +T ρ² 2σ .

Moreover, suppose there is θ∈R++ such thatθ≥sup{R(x)−R(y) : x∈X, y∈X∩domR}, and define

R⁰ := ρ√

√ T 2σθR.

Note that R⁰ is a (ρ√ σT /√

2θ)-strongly convex on X. Suppose R⁰ is also a mirror map. Then, pluggingR⁰ into the above inequality yields, for every u∈X,

Regret_T(LOMD_R⁰,ENEMY, u)≤ ρ√

√ T

2σθ(R(u)−min

x∈XR(x)) +ρ√

√θT

2σ ≤ ρ√

√θT

2σ +ρ√

√θT

2σ =ρ r2θT

σ , where in the second inequality we took the supremum overu∈X.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 130-136)