When Lazy and Eager OMD are Equivalent - Online Convex Optimization: Algorithms, Learning, and

Note that R⁰ is a (ρ√ σT /√

2θ)-strongly convex on X. Suppose R⁰ is also a mirror map. Then, pluggingR⁰ into the above inequality yields, for every u∈X,

Regret_T(LOMD_R⁰,ENEMY, u)≤ ρ√

√ T

2σθ(R(u)−min

x∈XR(x)) +ρ√

√θT

2σ ≤ ρ√

√θT

2σ +ρ√

√θT

2σ =ρ r2θT

σ , where in the second inequality we took the supremum overu∈X.

of a mirror map, we know that the Bregman projections onto X w.r.t. R lie in int(domR)∩X.

Therefore, if we haveint(domR)∩X⊆intX, equivalence between EOMD and LOMD holds forC!

In fact, requiring the iterates to lie in the interior of X is excessively strong. For example, the spectraplex is not full-dimensional since it lives inside a hyperplane w.r.t. the trace inner product.

Thus, the above argument does not cover the cases considered in [3]. Still, a very similar argument holds if we require the iterates to lie on ri(X). Indeed, the next theorem proves equivalence of the eager and lazy version of OMD under this assumption. One may note during the proof that a slightly technical issue arises: we need to ensure that both iterates pick the same subgradients whenever they have to choose one, that is, they have to pick a subgradient according to the same well-order on the subdifferentials. Usually the algorithms access the subgradients of a convex function f:E→(−∞,+∞] through a function of the formx∈X 7→g∈∂f(x). Thus, assuming the both algorithms pick the same subgradient given the same functionf and the same pointxis not a strong assumption.

Theorem 5.6.1. LetC:= (X,F) be an OCO instance such that X is a nonempty closed set and each f ∈ F is a proper closed function which is subdifferentiable on X. Let R:E→ (−∞,+∞]

be a mirror map forX. Ifint(domR)∩X⊆ri(X) and bothEOMD^X_R andLOMD^X_R use the same well-order on the sets∂f(x) for f ∈ F andx∈X, thenEOMD^X_R = LOMD^X_R.

Proof. Suppose int(domR)∩X ⊆ri(X). Let us prove, by induction onT ∈N, that EOMD^X_R(f) = LOMD^X_R(f), ∀f ∈ F^T.

ForT = 0, we have

{EOMD^X_R(hi)}= arg min

x∈X

R(x) ={LOMD^X_R(hi)}.

LetT ∈N\ {0}and f ∈ F^T. By induction, we have

xt:= EOMD^X_R(f1:t−1) = LOMD^X_R(f1:t−1), ∀t∈[T].

Let us show thatEOMD^X_R(f) = LOMD^X_R(f). Letgt∈∂ft(xt)be as in the definition ofEOMD^X_R(f) for each t∈[T](which are the same asgt in the definition of LOMD^X_R(f) is both oracles use the same well-order on the subdifferentials to pick the subgradients). By Theorem 5.4.2, there are p1, . . . , pT ∈Ewithpt∈NX(xt) for eacht∈[T]such that

{EOMD^X_R(f)}= arg min

x∈X

X^T

t=1

hg_t, xi+R(x) +

t=1

hp_t, xi

By (5.1.ii) from the mirror map definition,Π^R_X(z)∈int(domR)∩X ⊆ri(X)for anyz∈int(domR).

In particular,x_t= EOMD^X_R(f1:t−1)∈ri(X) for everyt∈[T]. Let us show that the termshp_t, xifor t∈[T]do not affect the point that attains the above minimum. More specifically, let us show

hp_t, xi=hp_t, x_ti, ∀x∈ri(X),∀t∈[T]. (5.21) Let t ∈ [T]. Since p_t ∈ N_X(x_t), we have hp_t, x−x_ti ≤ 0 for every x ∈ X. Thus, suppose there is x¯ ∈X such that hp_t,x¯−xti <0. Since xt∈ ri(X), by Theorem 3.2.2 there is µ >1 such that xµ:=µxt+ (1−µ)¯x∈X. Since1−µ≤0,

hp_t, xµ−xti= (1−µ)hp_t,x¯−xti>0,

a contradiction to the fact thatp_t∈N_X(x_t). This proves (5.21). Thus, using thatg_t∈∂f_t(x_t) are the same in the definitions ofEOMD^X_R(f) andLOMD^X_R(f) for each t∈[T], and by Theorem 5.5.1 we have

{EOMD^X_R(f)}= arg min

x∈X

X^T

t=1

hg_t, xi+R(x) +

t=1

hp_t, xi

(5.21)

= arg min

x∈X

X^T

t=1

hg_t, xi+R(x)

Thm.5.5.1

= {LOMD^X_R(f)}.

Chapter 6

Adaptive Regularization

Up to this point, all the algorithms we have seen for OCO problems required us to make a smart choice of regularizer/mirror map strategy based on the parameters of the instance at hand to guarantee good regret bounds. For example, most algorithms seen on Chapters 4 and 5 required previous knowledge of the Lipschitz constant of the functions played by the enemy to properly scale the regularizer and, in this way, guarantee low-regret bounds. Not only that, the strategies seen so far used little to no information from functions previously played by the enemy, usually only looking at most at the round number of the game to compute the new regularizer/mirror map increment.

Moreover, the regret bounds from previous chapters depend on the Lipschitz constant of the functions played by the enemy, since by Theorem 3.8.4 the Lipschitz constant usually upper bounds the dual norms of the subgradients. The problem is that this only reflects a worst-case scenario, not giving much information in the case of enemies who play functions with subgradients whose dual norm is small. Thus, one may wonder if it might be possible to derive regret bounds for some OCO algorithms which still depend on the dual norms of the subgradients (in an informative way) instead of using a crude upper bound on such norms. Intuitively, when playing against “easy” enemies, the player oracle should perform better, and we should be able to have better regret guarantees in such cases.

In this chapter we describe algorithms which use information from the subgradients of the functions played by the enemy to better choose its regularizer increments during the game. In order to derive regret bounds we show that these algorithms are special cases of theAdaReg algorithm, first described in [33], which is a clever application of the Adaptive Online Mirror Descent algorithm from Chapter 5. This enables us to make a unified analysis, first presented in [33], of two known and similar algorithms from the OCO literature: the AdaGrad [31] and Online Newton Step [37]

algorithms. As a warm-up, on Section 6.1 we describe an online mirror descent algorithm (and its lazy version) with step sizes which adapt based on the subgradient of the enemy’s functions. On Section 6.2 we present theAdaReg algorithm, discuss its main ideas, and derive a general regret bound by writing it as an AdaOMD algorithm with a smart choice of mirror map strategy. On Section 6.3 we describe the AdaGrad algorithm and derive a regret bound for it from the regret bound we have for AdaReg. On Section 6.4 we show a more efficient version of AdaGrad which only uses diagonal instead of general matrices to skew the subgradient steps. On Section 6.5 we define exp-concave functions and present the Online Newton Step algorithm, which has a logarithmic regret bound against exp-concave functions. Again, we derive regret bounds for the Online Newton Step algorithm by writing it as an application of the AdaReg algorithm. Finally, on Section 6.6 we show a step size strategy for online gradient descent which attains logarithmic regret against strongly convex functions. Additionally, we show how it can be seen as a “scalar version” of the

Online Newton Step algorithm.

In this chapter we will extensively use norms induced by positive definite matrices. Thus, let us define some notation related to norms based on positive definite matrices. LetA∈S^d₊₊. Abusing the notation of Bregman projection, for every closed convex setX ⊆R^d define

{Π^A_X(z)}:= arg min

x∈X

kx−zk_A, ∀z∈R^d.

If R:= ¹₂k·k²_A, then one may verify that B_R(x, y) = ¹₂kx−yk²_A. Thus, in this case, for any closed convex set X⊆R^d we have Π^R_X = Π^A_X. This equation of Bregman projections may be used without reference throughout this chapter. Moreover, throughout this chapter we consider thatS^dis equipped with the trace inner product given byhX, Yi:= Tr(XY) for everyX, Y ∈S^d.

6.1 A First Example: Adaptive Online Gradient Descent

In Chapter 4, we have seen different FTRL regularizer strategies for some classes of OCO instances.

Among all the regularizer strategies seen in that chapter, the regret bounds when using them in AdaFTRLdepended mainly on whether the regularizer strategy was proximal or not, that is, whether the regularizer increment at roundt+ 1 was minimized by the iterate from round t (if t >1) or not. In spite of the discussion made about the differences on the regret bounds we have for each of these regularizer strategy classes (see Theorem 4.4.3, 4.4.4, and the discussion which follows them), in all the cases studied in Chapter 4, proximal regularizer strategies did not yield any significant differences on the final regret bounds if compared to the ones given by non-proximal strategies. Thus, even though the regret bounds given by Theorems 4.4.3 and 4.4.4 hint at the possibility of proximal regularizer strategies being able to adapt better to the functions picked by the enemy, we have not yet exploited this difference in this text.

One aspect which also motivates the investigation of the advantages of proximal regularizer strategies are the connections of Online Mirror Descent algorithms to Adaptive FTRL algorithms seen in Chapter 5. Namely, letR be a mirror map strategy for some OCO instanceC:= (X,F). In Section 5.4 we have seen that AdaOMD^X_R is equal toAdaFTRLR⁰ when applied to linear functions, where R⁰ is a proximal regularizer strategy forC mostly based on Bregman divergences w.r.t. the regularizer increments given by R. Additionally, in Section 5.5 we have seen that AdaDA^X_R is equivalent to AdaFTRLR⁰⁰ when applied to linear functions, where R⁰⁰ is equal to Reverywhere but on the empty sequence, where R⁰⁰(hi) :=R(hi) +δ(· |X). Thus, investigating when and how proximal FTRL regularizer strategies can be advantageous may shed light on the key differences between AdaOMDand AdaDA.

Our main goal in this chapter is to devise mirror map strategies for AdaOMD which take advantage of the adaptiveness present on its regret bound, which is inherited from the regret bound for proximal FTRL regularizer strategies from Theorem 4.4.4. One may build proximal FTRL regularizer strategies similar to the mirror maps seen in this chapter which yield similar regret bounds. Still, in this chapter our focus is on the Adaptive Online Mirror Descent algorithm since looking at adaptive regularizers forAdaOMDyields an unified analysis of two major OCO algorithms:

AdaGrad [31] and Online Newton Step [37].

As a warm-up, let us devise versions of the Online Mirror Descent algorithm (with a static mirror map) with time-varying step sizes based on the subgradients of the enemy’s functions. Our goal is to devise an algorithm with low-regret guarantees regardless of the number of roundsand without knowledge of a bound on the norms of the subgradients. Not only that, we want the regret bound to be better for an enemy which picks “easy” functions, that is, functions whose subgradients

have small (dual) norm. For example, letC:= (X,F) be an OCO instance such that eachf ∈ F is subdifferentiable on X ⊆R^d with¹ kgk₂ ≤ρ for every x ∈X and g ∈∂f(x) and that there is θ∈R++ such thatθ≥sup_x,y∈X ¹₂kx−yk²₂ . On Section 4.7 we have seen how to build a proximal FTRL regularizer strategy which yields a regret bound that holds for any number of rounds. Roughly, Corollary 4.7.3 tells us thatAdaFTRL with (cumulative) regularizer R_t at roundt∈N for t >1 given by

Rt(x) :=

s θ (t−1)ρ²

2kx−xt−1k²₂

, ∀x∈R^d,

where xt−1 is the point chosen by the player at round t−1, yields regret smaller than2ρ√

θT for a game withT ∈N rounds. Letg1, . . . , gT ∈R^d be such that, for everyt∈[T], we havegt∈∂ft(xt) where f_t∈ F is the function picked by the enemy on roundt. Observe that Pt−1

j=1kg_jk²₂≤(t−1)ρ². That is, the denominator of the multiplicative factor on the regularizer at round tis an upper-bound on the sum of the squared norms of the subgradients of the functions played by the enemy so far (i.e., on rounds1 up tot−1). The idea to make this regularizer strategy adaptive is to, instead of using such an upper bound on the multiplicative factor of the regularizer, use the actual sum of the squared norms of the subgradients, without needing the knowledge of the constant ρ in advance.

Interestingly, this implies that in the case where the enemy plays functions whose subgradients have small norm, then the step sizes² will be bigger, that is, the algorithm will be more aggressive. In the following theorem we describe a mirror map strategy with the same idea. Later we will see how this strategy need not work very well for non-proximal regularizer strategies.

Theorem 6.1.1. Let C := (X,F) be an OCO instance such that X is closed and such that each f ∈ F is a proper closed function which is subdifferentiable onX, and letR:E→(−∞,+∞]be a mirror map for X which is σ-strongly convex onX w.r.t. a norm k·konE. Moreover, suppose there isθ∈R++ such thatθ≥sup{B_R(u, x) : u∈X, x∈X∩int(domR)} and define

η(g) :=







q _θ

2Pt

j=1kg_jk²₂ if g6∈ {0,hi},

1 otherwise,

for everyg=hg₁, . . . , g_ti ∈Seq(E),and R(f) :=

η(g1:t) −[t >0] 1 η(g1:t−1)

√σR, for everyf =hf₁, . . . , fti ∈Seq(F),where g_i ∈E is as in AdaOMDR(f)for i∈[t].

LetENEMY be an enemy oracle forC, letT ∈N, and define

(x,f) := OCOC(AdaOMD^X_R,ENEMY, T).

Finally, let g_t ∈ ∂f_t(x_t) be the same as in the definition of AdaOMD^X_R(f) on Algorithm 5.1 for each t∈[T]. Then

Regret(AdaOMD^X_R,f, X)≤ v u u t

2θ σ

t=1

kg_tk²_∗.

Moreover, if R= ¹₂k·k²₂, then, by setting η_t:=η(g1:t−1) for eacht∈[T]we have

x_t= Π^k·k_X ²([t >1](xt−1−η_tgt−1)), ∀t∈[T]. (6.1)

1We are using the squared`2-norm because for the sake of simplicity and since it yields the online gradient descent algorithm.

2The constant multiplying the regularizer.

Proof. Let us first prove the regret bound. First of all, by the definition of mirror map, it is clear that µRis a mirror map for X for anyµ∈R++ sinceR is a mirror map forX. Thus, Ris a mirror map strategy for C. For eacht∈ {1, . . . , T + 1} definert:=R(f_1:t−1). Note that, for everyt∈[T],

i=1

ri= 1

√σηt

R, which is (√

σ/η_t)-strongly convex w.r.t. k·k. By setting x₀ := x₁, Theorem 5.4.3 yields, for every u∈X,

Regret(AdaOMD^X_R,f, u)≤

T+1

t=1

B_r_t(u, xt−1) +1 2

t=1

η_t+1

√σ kg_tk²_∗

= 1 2

T+1

t=1

1 ηt

−[t >1] 1 ηt−1

√σBR(u, xt−1) + 1 2√

t=1

ηt+1kg_tk²_∗

≤ θ

2η_T₊₁√

σ + 1 2√ σ

t=1

ηt+1kg_tk²_∗

≤ 1 2

v u u t

2θ σ

t=1

kg_tk²_∗+1 2

r θ 2σ

t=1

kg_tk²_∗ q

j=1kg_jk²_∗

Le. 4.6.2

≤ 1 2

v u u t2θ

t=1

kg_tk²_∗+1 2

v u u t2θ

t=1

kg_tk²_∗= v u u t2θ

t=1

kg_tk²_∗.

Let us show that (6.1) holds for R:= ¹₂k·k²₂. First, recall that, by Lemma 5.2.1, ¹₂k·k²₂ is a mirror map forXwhich is1-strongly convex w.r.t. the`2-norm. Let us proceed with the proof by induction on t∈[T]. Fort= 1, by the definition ofAdaOMD^X_R(hi)we have{x₁}= arg min_x∈Xkxk₂ ={Π^k·k_X ²(0)}.

Thus, let t ∈ {2, . . . , T}, and let R_t and y_t be defined as in the definition of AdaOMD^X_R(f) in Algorithm 5.1, that is,

Rt:=

i=1

rt= 1 2ηt

k·k²₂ and yt:=∇R_t(xt−1)−gt= 1 ηt

xt−1−gt.

So, since the`₂-norm is self-dual, by Theorem 3.8.2 we have(¹₂k·k²₂)^∗ = ¹₂k·k²₂. Thus, by Theorem 3.4.3 we have R^∗_t(y) = _2η¹

tkη_tyk²₂ = ^η₂^tkyk²₂ for every y∈E. Finally, the definition ofx_tin AdaOMD^X_R(f) yields

xt= Π^R_X^t(∇R^∗_t(yt)) = Π^k·k_X ²(xt−1−ηtgt−1).

Note that the above regret bound is at least as good as the one for EOMD from Corollary 5.4.4 (or as the one for the proximal FTRL example from Corollary 4.7.3). Indeed, let C:= (X,F) be an OCO instance,R be a mirror map forX which is 1-strongly convex onX w.r.t. a norm k·kon E, T ∈N, and letθ andgt∈R^d be as in the above theorem for everyt∈[T]. If we know ρ∈Rsuch thatkg_tk_∗≤ρ for everyt∈N, then

v u u t2θ

t=1

kg_tk²_∗ ≤ρ

√ 2θT ,

and the latter exactly matches the bound from Corollary 5.4.4. However, the bound given by Theorem 6.1.1 can be much better since, if the enemy picks many functions whose subgradients norm are way smaller than ρ, then the algorithm will be able to exploit this fact (using bigger step sizes) and attain smaller regret.

One may be wondering if a mirror map strategy similar to the one from the above theorem may work for the Adaptive Dual Averaging algorithm. As we have discussed on Section 5.5, Dual Averaging is practically equivalent to the general FTRL algorithm applied to the subgradients of the functions picked by the enemy. Thus, not surprisingly, the regret bound for AdaDAfrom Corollary 5.5.2 is identical to the regret bound for AdaFTRL with general regularizer strategies given by Theorem 4.4.3. That is, when usingAdaDA we do not have the additional adaptiveness present in the proximal FTRL regret bound from Theorem 4.4.4. Even though we can adjust the multiplicative factors ofAdaDA with a mirror map similar to the one from the last theorem so that the final regret bound is similar as well, we still need in this case knowledge of the Lipschitz constant (or an upper-bound on the subgradients’ norms) of the functions played by the enemy to properly adjust the multiplicative factor of the mirror map strategy. This is one case where the intuition that proximal regularizer strategies can “measure the subgradient from roundtwith the norm related to the regularizer increment from roundt+ 1”, as discussed on Section 4.4, yields a relevant difference between regret bounds.

Theorem 6.1.2. Let C := (X,F) be an OCO instance such that X is closed and such that each f ∈ F is a proper closed function which is subdifferentiable onX and let R:E→(−∞,+∞]be a mirror map forX which isσ-strongly convex on X w.r.t. a norm k·kon E. Suppose eachf ∈ F is ρ-Lipschitz continuous w.r.t.k·kon a convex setD⊆Ewith nonempty interior such that³ intD⊇X, and suppose there is θ∈R++ such thatθ≥sup{R(u)−R(x) : u, x∈X}. Define

η(g) :=

s θ 2(ρ+Pt

j=1kg_jk²₂) for everyg=hg₁, . . . , g_ti ∈Seq(R^d), R(f) :=

η(g_1:t) −[t >0] 1 η(g1:t−1)

√σR, for everyf =hf₁, . . . , f_ti ∈Seq(F),where gi ∈R^d is as in AdaDAR(f) for i∈[t].

LetENEMY be an enemy oracle forC, letT ∈N, and define

(x,f) := OCOC(AdaDA^X_R,ENEMY, T).

Moreover, let gt ∈ ∂ft(xt) be the same as in the definition of AdaDA^X_R(f) on Algorithm 5.3 for each t∈[T]and setη_t:=η(g1:t−1) for eacht∈[T]. Then

Regret(AdaDA^X_R,f, X)≤ v u u t2θ

σ (ρ+

T−1

t=1

kg_tk²_∗).

Proof. Since µRis a mirror map forX for anyµ∈R++, we have thatRis a mirror map strategy for C. For each t∈ {1, . . . , T + 1}definert:=R(f_1:t−1). Note that, for everyt∈[T],

i=1

r_i= 1 ηt

√σR,

3Here we assume this so that, for anyf ∈ F, any subgradient of f at a point ofF has small dual norm by Theorem 3.8.4. Still, if one can control the choice of subgradients ofAdaDA, one may only requireX⊆D.

which is (√

σ/η_t)-strongly convex w.r.t. k·k. By settingx₀ := x₁, Corollary 5.5.2 yields, for every u∈X,

Regret(AdaDA^X_R,f, u)≤

t=1

(rt(u)−rt(xt)) +1 2

t=1

ηt

√σkg_tk²_∗

= 1 2

t=1

η_t −[t >1] 1 ηt−1

√σ(R(u)−R(x_t)) + 1 2√ σ

t=1

η_tkg_tk²_∗

≤ θ 2ηT

√σ + 1 2√

t=1

η_tkg_tk²_∗

≤ 1 2

v u u t2θ

ρ+

T−1

t=1

kg_tk²_∗ +1

2 r θ

2σ

t=1

kg_tk²_∗ q

ρ+Pt−1 j=1kg_jk²_∗

≤ 1 2

v u u t

2θ σ

ρ+

T−1

t=1

kg_tk²_∗ +1

2 r θ

2σ

t=1

kg_tk²_∗ q

j=1kg_jk²_∗

Le. 4.6.2

≤ 1 2

v u u t2θ

ρ+

T−1

t=1

kg_tk²_∗ +1

2 v u u t2θ

t=1

kg_tk²_∗

≤ v u u t

2θ σ

ρ+

T−1

t=1

kg_tk²_∗ .

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 136-144)