The AdaReg Algorithm - Online Convex Optimization: Algorithms, Learning, and Duality

which is (√

σ/η_t)-strongly convex w.r.t. k·k. By settingx₀ := x₁, Corollary 5.5.2 yields, for every u∈X,

Regret(AdaDA^X_R,f, u)≤

t=1

(rt(u)−rt(xt)) +1 2

t=1

ηt

√σkg_tk²_∗

= 1 2

t=1

η_t −[t >1] 1 ηt−1

√σ(R(u)−R(x_t)) + 1 2√ σ

t=1

η_tkg_tk²_∗

≤ θ 2ηT

√σ + 1 2√

t=1

η_tkg_tk²_∗

≤ 1 2

v u u t2θ

ρ+

T−1

t=1

kg_tk²_∗ +1

2 r θ

2σ

t=1

kg_tk²_∗ q

ρ+Pt−1 j=1kg_jk²_∗

≤ 1 2

v u u t

2θ σ

ρ+

T−1

t=1

kg_tk²_∗ +1

2 r θ

2σ

t=1

kg_tk²_∗ q

j=1kg_jk²_∗

Le. 4.6.2

≤ 1 2

v u u t2θ

ρ+

T−1

t=1

kg_tk²_∗ +1

2 v u u t2θ

t=1

kg_tk²_∗

≤ v u u t

2θ σ

ρ+

T−1

t=1

kg_tk²_∗ .

rank-one matrices based mainly on the subgradients of the previous functions picked by the enemy.

In this way, the update rule of the iterate ofAdaGrad at round t∈N\ {0} is of the form xt= Π^H

−1 t

X ([t >1](xt−1−Htgt−1)), (6.2) where X ⊆ R^d is the set from where the player is allowed to pick his points, xt−1 and gt−1 are, respectively, the iterate and subgradient of enemy’s choice at roundt−1, andΠ^H

−1 t

X is the projection ontoXw.r.t. the normk·k_H−1

. Thus, the matrixH_tintuitively skews the subgradient of the previous round in a desirable way, and adjusts the projection to balance the skewed subgradient step.

Another algorithm for Online Convex Optimization with a similar update rule is the Online Newton Step(ONS) algorithm [37], which is guaranteed to attain regret with a logarithmic dependence on the number of rounds if the functions played by the enemy are guaranteed to be differentiable andexp-concave, a generalization of strong convexity which will be formally described and discussed later. The algorithm’s update rule is of the same form of (6.2), with only the choice of matrixHt

being different, even though it is still a function of the subgradients of the previous choices of the enemy.

In spite of their similarities, AdaGrad and ONS were discovered independently and each had non-related analyses. The authors of [33] proposed the AdaReg algorithm and showed that both AdaGrad and ONS are special cases of AdaReg. This sheds some light in the intuition behind these algorithms. Additionally, it leaves room for the creation of other similar and interesting OCO algorithms. We describe a player oracle which implements the AdaReg algorithm in Algorithm 6.1.

The AdaReg algorithm is parameterized by a function Φ :S^d→(−∞,+∞], called meta-regularizer, which dictates which matrices to use in the update of (6.2).

Definition 6.2.1 (Meta-regularizer). A functionΦ :S^d→(−∞,+∞]is ameta-regularizerif, for any G∈S^d++,

(6.3.i) the infimuminf_H∈_S^d

++ hG, Hi+ Φ(H)

is attained, (6.3.ii) for any g∈E, if

H_T ∈arg min

H∈S^d++

hG, Hi+ Φ(H)

and H_T₊₁ ∈arg min

H∈S^d++

hG+gg^T, Hi+ Φ(H) , thenH_T H_T₊₁ (which impliesH_T₊₁H_T since H_T and H_T₊₁ are positive definite).

Let us look a little bit closer at the definition of AdaReg on Algorithm 6.1 for a game with T ∈Nrounds and some ε >0 and, during this discussion, we look at the reasons for the conditions imposed on meta-regularizers. Let C := (X,F) be an OCO instance, let T ∈N, and letf ∈ F^T. Moreover, let t∈ {0, . . . , T −1}. At round t+ 1, i.e. when the algorithm is computingx_t+1, the algorithm builds a positive definite matrix G_t, which is the sum of rank-one matrices (based on the subgradients of the enemy’s functions) plus⁴ εI. Then AdaReg performs its key step: the choice of the matrix H_t+1 which it uses to perform the “skewed” gradient step as in (6.2). Namely, AdaReg with meta-regularizer Φpicks H_t+1 that attains

inf

H∈S^d++

(hG_t, Hi+ Φ(H)), (6.4)

4The main goal of this latter term is to ensure thatGtis invertible, but the value ofε >0may affect the guarantees of the algorithms we shall see later on.

Algorithm 6.1 Definition ofAdaReg^X_Φ hf₁, . . . , fTi Input:

(i) A closed convex set X⊆R^d,

(ii) Convex functions f₁, . . . , f_T ∈ F for some T ∈ N and F ⊆ (−∞,+∞]^R^d such that f_t is subdifferentiable onX for eacht∈[T],

(iii) A meta-regularizerΦ : S^d→(−∞,+∞],

(iv) A real number ε >0 (usually clear from the context) Output: xT+1∈X

G₀ ←εI

LetH₁ ∈arg min_H∈

S^d++(hG₀, Hi+ Φ(H)) Let{x₁} ←arg min_x∈Xkxk_H−1

1 = arg min_x∈Xx^TH₁⁻¹x fort= 1 to T do

. Computations for roundt+ 1 Computegt∈∂ft(xt)

G_t←Gt−1+g_tg^T_t

ComputeHt+1 ∈arg min_H_∈_S^d

++(hG_t, Hi+ Φ(H)) xt+1←Π^H

−1 t+1

X (xt−Ht+1gt) return x_T₊₁

where the above infimum is attained by property (6.3.i). Although the above expression can seem cryptic at first, it has a very elegant interpretation. By the definition of the AdaReg oracle, we have G_t=εI+P_t

i=1g_ig_i^T, where for eacht∈[T]the vectorg_t∈R^d is a subgradient as defined in AdaReg^X_Φ(f_1:t). Thus, for everyH ∈S^d++ we have

hG_t, Hi+ Φ(H) =

i=1

hg_ig_i^T, Hi+εTr(H) + Φ(H) =

i=1

Tr(g_ig^T_i H) +εTr(H) + Φ(H)

i=1

g_i^THgi+εTr(H) + Φ(H) =

i=1

kg_ik²_H +εTr(H) + Φ(H).

(6.5)

That is, the matrix H_t+1 is chosen so that the size of the subgradients measured by its induced norm are minimized while still not making Φ(H) +εTr(H) too high⁵. Recall that the sum of the squared norms of the subgradients is directly connected to almost all the regret bounds seen on Chapters 4 and 5. Thus,H_t+1 can be seen roughly as the best matrix with low complexity w.r.t. the meta-regularizer Φthrough which to measure/see the subgradients of the functions played by the enemy so far. Another way to see the choice of Ht+1, which is the main idea the authors of [33]

use in their analysis of AdaReg, is to note thatH_t+1 is the point picked byFTRL_Φ⁰(hψ₁, . . . , ψ_ti), where ψ_i(H) :=hg_ig^T_i , Hi for eachi∈[t]andΦ⁰:= Φ +εTr(·) +δ(· |S^d₊₊). That is, the problem of choosing a matrix norm through which to measure the subgradients played by the enemy is seen as a separate OCO instance! On the regret bounds which we prove later in this section it will be clear how well this strategy minimizes the norms of the subgradients.

The reader may still be confused about condition (6.3.ii) since, during the above discussion, this condition was never mentioned. Not only that, the AdaReg oracle from Algorithm 6.1 does not seem to need this condition for all of its operations to be well-defined. Indeed, condition (6.3.ii) from the

5The value ofΦ(H) +εTr(H)here can be interpreted as the “complexity” of the normk·kH.

definition of meta-regularizers is not needed for the definition of AdaReg to make sense. However, as we shall soon see, this condition is fundamental for the regret bounds that we derive to hold.

Interestingly, even though condition (6.3.ii) is not explicitly stated on [33], all the meta-regularizers the authors use satisfy this condition (which is used explicitly in their proofs).

As one may have noticed, the update on (6.2) resembles a lot the update from the Adaptive Online Mirror Descent. Indeed, to bound the regret of AdaReg we will write it as an Adaptive Online Mirror Descent algorithm with a carefully⁶ crafted mirror map strategy, which we formally define in Algorithm 6.2.

Algorithm 6.2 Definition of

M(X,Φ, ε)

hf₁, . . . , fTi Input:

(i) A closed convex set X⊆R^d,

(ii) Convex functions f1, . . . , fT ∈ F for some T ∈ N and F ⊆ (−∞,+∞]^R^d such that ft is subdifferentiable onX for eacht∈[T],

(iii) A meta-regularizerΦ : S^d→(−∞,+∞], (iv) A real number ε >0

Output: A functionrT+1:R^d→(−∞,+∞]

fort= 1 to T do .Capture subgradients used on rounds1, . . . , T x_t←AdaOMD^X_M(X,Φ,ε)(hf₁, . . . , ft−1i)

. Equip the right well-order to match the subgradient choice ofAdaOMD^X_M(X,Φ,ε) Equip∂ft(xt) with the same well-order used byAdaOMD^X_M(X,Φ,ε)

Pickg_t∈∂f_t(x_t)

.Compute the mirror map increment for round T+ 1 G_T−1 ←εI+PT−1

t=1 gtg_t^T G_T ←G_T−1+g_Tg_T^T LetHT ∈arg min_H∈_S^d

++(hG_T₋₁, Hi+ Φ(H)) LetH_T₊₁ ∈arg min_H_∈

S^d++(hG_T, Hi+ Φ(H)) DT+1 ←H_T⁻¹₊₁−[T >0]H_T⁻¹

return x∈R^d7→ ¹₂x^TD_T₊₁x= ¹₂kxk²_D

T+1

Note that ifΦon the definition ofMin Algorithm 6.2 is a meta-regularizer, then the matricesD_t on the definition ofMare positive semidefinite by condition (6.3.ii). That is, the functions delivered by Mare always convex in this case. This is important if we want to writeAdaReg in the form of an AdaOMD algorithm, since in order forMto be a mirror map strategy (and for us to apply the regret bounds we have proved on Chapter 5), we need the functions it delivers at each round, i.e.

the mirror map increments, to be convex. In the following lemma we prove that if we plug intoMa meta-regularizer, thenMis a mirror map strategy and the mirror map it builds at each round are scaled squared matrix norms.

Lemma 6.2.2. Let C := (X,F) be an OCO instance such that X is a closed set and such that eachf ∈ F is a proper closed function which is subdifferentiable on X. Moreover, let ε >0 and let Φ : S^d→(−∞,+∞]be a meta-regularizer. Let T ∈N, let f ∈ F^T, and letHt∈S^d₊₊ andDt∈S^d

6One may note that we need to ensure the subgradients used by the mirror map strategy matches the ones used by the AdaOMD oracle, and we do so by synchronizing the well-orders used on the subdifferentials by the AdaOMD oracle and by the mirror map strategy. See the discussion following Definition 4.2.1 to recall why we equip well-orders to the subdifferentials used.

be as defined inM(X,Φ, ε)(f1:t−1) for eacht∈ {1, . . . , T + 1}. Finally, for every t∈ {1, . . . , T + 1}

define

r_t:=M(X,Φ, ε))(f1:t−1) and R_t:=

i=1

r_t.

Then M(X,Φ, ε) is a mirror map strategy for C which is differentiable onR^d. Moreover, for every t ∈ {1, . . . , T + 1} we have D_t 0, r_t = ¹₂k·k²_D

t, and R_t = ¹₂k·k²

H_t⁻¹. Moreover, R_t is 1-strongly convex w.r.t. k·k_H⁻¹

t on R^d for every t∈ {1, . . . , T + 1}.

Proof. Let t∈ {1, . . . , T+ 1}. First, note that sinceΦis a meta-regularizer, by condition (6.3.ii) we have that D_t0. Let us now show that

rt= ¹₂k·k²_D

t and Rt= ¹₂k·k²

H_t⁻¹. (6.6)

Note that the form ofrt as in (6.6) holds by the definition of[M(X,Φ, ε)](f1:t−1). Moreover, for everyx∈R^d we have

R_t(x) =

i=1

r_i(x) =

i=1

2x^TD_ix= 1

2x^TX^t

i=1

(H_i⁻¹−[i >1]H_i−1⁻¹) x= 1

2x^TH_t⁻¹x.

This proves (6.6). Let us now show that

(6.7) M(X,Φ, ε)is a mirror map strategy for Cwhich is differentiable onR^dand such

that R_t is1-strongly convex w.r.t. k·k_H−1 t onE.

First, note that rt is two-times continuously differentiable (and, thus, closed) with ∇²rt(x) =Dt for anyx∈R^d. SinceD_t0by the conditions of a meta-regularizer, by Lemma 3.1.1 we conclude that r_t is convex. It only remains to show that R_t is a mirror forX. That is, we need to prove that

(i) R_t closed, proper,1-strongly convex⁷ on R^dw.r.t. k·k_H−1

t and differentiable on R^d, (ii) R^d= int(domR_t),

(iii) for anyy∈R^d, the infima infx∈XB_R_t(x, y)and infx∈XR_t(x) are attained, and (iv) { ∇R(x) : x∈R^d}=R^d.

First, note that (ii) clearly holds, and since ∇R_t(x) =H_t⁻¹x for any x∈R^d, we conclude that (iv) holds sinceH_t⁻¹ is invertible. Moreover,R_t is two-times continuously differentiable on R^d, which implies that R_t is proper and closed (in fact, continuous), and since ∇²R_t(x) = H_t⁻¹ 0 for any x∈R^d, by Lemma 3.1.1 we conclude that Rt is convex. Note that ifRtis strongly convex, BRy(·, y) also is for anyy∈R^d, and then then the infima from (iii) would be attained by Lemma 3.9.14. Thus, it only remains to show that H_t is1-strongly convex w.r.t. k·k_H−1

. To see that⁸, note that for every

7The definition of mirror map requires strict convexity, but recall that strong convexity implies strict convexity by definition.

8One easier way to prove strong convexity ofRt is to note thatk·k_H−1

t is a norm induced by the inner product (x, y)∈R^d×R^d7→x^TH_t⁻¹yand then use Lemma 3.9.5. However, using direct computations seems less cumbersome

in this case

x, y∈R^d we have

2kx−yk²

H_t⁻¹ = ¹₂(x−y)^TH_t⁻¹(x−y) = ¹₂x^TH_t⁻¹x+ ¹₂y^TH_t⁻¹y−x^TH_t⁻¹y

= ¹₂x^TH_t⁻¹x+−¹₂y^TH_t⁻¹y−(H_t⁻¹y)^T(x−y)

= ¹₂kxk²

H_t⁻¹ +−¹₂kyk²

H_t⁻¹−(H_t⁻¹y)^T(x−y)

=Rt(x)−Rt(y)− ∇R_t(y)^T(x−y).

By Theorem 3.9.7 we conclude thatRtis1-strongly convex w.r.t. k·k_H−1

t , which concludes the proof of (6.7).

With the above lemma, we have the guarantee thatMapplied to a meta-regularizer and other properly chosen parameters is indeed a mirror map. In the next theorem we prove the main result of this section: ifC:= (X,F)is an OCO instance,ε >0, andΦ : S^d→(−∞,+∞]is a meta-regularizer, thenAdaReg^X_Φ = AdaOMD^X_M(X,Φ,ε). This theorem will allow us to derive regret bounds forAdaReg and the Online Newton Step algorithm from the regret bounds we have for AdaOMD.

Theorem 6.2.3. LetC:= (X,F) be an OCO instance such that X is a nonempty closed set and such that each f ∈ F is a proper closed function which is subdifferentiable on X. Moreover, let ε >0 andΦ :S^d→(−∞,+∞] be a meta-regularizer. Finally, suppose the same well-order⁹ over the sets used in the definition ofAdaReg^X_Φ are the same as in the definition of AdaOMD^X_M(X,Φ,ε). Then, for any T ∈N andf ∈ F^T,

AdaReg^X_Φ(f) = AdaOMD^X_M(X,Φ,ε)(f),

and the matrix H_T₊₁ ∈ S^d++ as defined in AdaReg^X_Φ(f) is equal to the matrix H_T₊₁ ∈ S^d++ as defined inM(X,Φ, ε)(f).

Proof. LetT ∈Nand f ∈ F^T. Moreover, for eacht∈ {1, . . . , T+ 1} define Rt:=

i=1

[M(X,Φ, ε)](f1:t−1) and xt:= AdaOMD^X_M(X,Φ,ε)(f1:t−1).

Finally, for each t∈ {1, . . . , T + 1} let Ht ∈S^d++ be defined as in [M(X,Φ, ε)](f1:t−1). Note that for everyt∈ {1, . . . , T + 1} we have thatH_t is the same as the one in AdaReg^X_Φ(f) by definition and since all the sets used in the definitions of the AdaReg and AdaOMD oracles are the same. Let us prove by induction ont∈ {1, . . . , T+ 1} that

x_t= AdaReg^X_Φ(f1:t−1), ∀t∈ {1, . . . , T + 1}.

Fort= 1, we haveR₁(x) = ¹₂k·k²

H₁⁻¹. Thus, the definition ofAdaOMD^X_M(X,Φ,ε) yields {x₁}= arg min

x∈X

R₁(x) = arg min

x∈X

kxk_H⁻¹

1 ={AdaReg^X_Φ(hi)}.

Lett∈ {2, . . . , T + 1}. By Lemma 6.2.2, we have thatR_t= ¹₂k·k²

H_t⁻¹. With that, we have yt:=∇R_t(xt−1)−gt−1 =H_t⁻¹xt−1−gt−1.

9This is only a technical assumption to assure that, if we have in both algorithms a statement such as “let gt∈∂ft(xt)”, then in both algorithms the element picked from the set∂ft(xt)is the same.

By Lemma 3.8.5, the dual norm of k·k_H−1

t is k·k_H_t, and by Theorem 3.8.2 we have R^∗_t = ¹₂k·k²_H

t. This together with the definition of AdaOMD(f1:t−1)yields

x_t= Π^H

−1 t

X (∇R_t^∗(y_t)) = Π^H

−1 t

X (H_t(H_t⁻¹xt−1−gt−1))

= Π^H

−1 t

X (xt−1−H_tgt−1) = AdaReg^X_Φ(f1:t−1).

Finally, let us now show a regret bound for the AdaReg algorithm. The proof of the next regret bound has two key steps. The first is almost obvious given the previous theorem: use the regret bound forAdaOMDwe have proved previously (see Theorem 5.4.3). This together with the previous theorem yields a regret bound which is arguably not very useful. The second key step in the next proof is to use the FTL–BTL Lemma on the matricesH_tused by AdaRegto show the optimality, in some sense, of this choice of matrices with respect to the minimization of the norms of the subgradients. This yields a neat regret bound, whose intuition we discuss after proving the next theorem.

Theorem 6.2.4. LetC:= (X,F)be an OCO instance such that X⊆R^dis a nonempty closed set and such that eachf ∈ F is a proper closed function which is subdifferentiable onX. Let ε >0and let Φ : S^d →(−∞,+∞] be a meta-regularizer. LetT ∈N, let ENEMYbe an enemy oracle for C, and define

(x,f) := OCOC(AdaReg^X_Φ,ENEMY, T).

For each t ∈ {1, . . . , T + 1}, let Ht ∈S^d₊₊ be as in the definition of AdaReg^X_Φ(f1:t−1) and define Dt:=H_t⁻¹−[t >1]H_t−1⁻¹. Finally, let G_T ∈S^d++ be as in the definition ofAdaReg^X_Φ(f). Then, for any u∈X and for x₀ :=x₁,

Regret(AdaReg^X_Φ,f, u)≤ 1 2

t=0

ku−x_tk²_D_t+1+1 2 min

H∈S^d++

(hG_T, Hi+ Φ(H)−Φ(H₁)).

Proof. By Theorem 6.2.3, we haveAdaReg^X_Φ = AdaOMD^X_M(X,Φ,ε) and, in particular Regret(AdaReg^X_Φ,f, u) = Regret(AdaOMD^X_M(X,Φ,ε),f, u), ∀u∈R^d.

Thus, it suffices to bound the right hand side of the above equation. For everyt∈ {1, . . . , T + 1}, define

rt:= [M(X,Φ, ε)](f1:t−1) and Rt:=

i=1

ri.

By Lemma 6.2.2 we know that M(X,Φ, ε) is a mirror map strategy for C and that for every t∈ {1, . . . , T + 1}we have r_t= ¹₂k·k²_D

t and R_t= ¹₂k·k²

H_t⁻¹, the latter being 1-strongly convex w.r.t.

k·k_H−1

t on R^d. By Lemma 3.8.5 we have that the dual norm ofk·k_H−1

t isk·k_H_t. Finally, setx₀:=x₁ and letg_t∈∂f_t(x_t)be as in the definition ofAdaOMD_M(X,Φ,ε)(f)for everyt∈[T]. Then, for every u∈R^d Theorem 5.4.3 yields

Regret(AdaOMD^X_M(X,Φ,ε),f, u)≤

T+1

t=1

B_r_t(u, xt−1) +1 2

t=1

kg_tk²_H

t+1

= 1 2

T+1

t=1

ku−xt−1k²_D_t +1 2

t=1

kg_tk²_H_t+1.

Thus, it only remains to show that

t=1

kg_tk²_H_t+1 ≤ min

H∈S^d₊₊

(hG_T, Hi+ Φ(H)−Φ(H1)). (6.8) Lett∈[T]. Note that

kg_tk²_H

t+1 =g^T_tHt+1gt= Tr(g^T_tHt+1gt) = Tr(gtg^T_tHt+1) =hg_tg_t^T, Ht+1i. (6.9) DefineΦ⁰ := Φ +εTr(·) +δ(· |S^d++). By definition ofM(X,Φ, ε), we have

H_t∈arg min

H∈S^d++

X^t−1

i=1

hg_ig^T_i , Hi+εhI, Hi+ Φ(H)

= arg min

H∈S^d

X^t−1

i=1

hg_ig_i^T, Hi+ Φ⁰(H) . Thus, by setting ψt(H) :=hg_tg_t^T, Hifor every t∈[T]and H ∈S^d, we conclude that

H_t= FTRL_Φ(hψ₁, . . . , ψt−1i), ∀t∈ {1, . . . , T+ 1}.

Therefore, the FTL–BTL Lemma (Lemma 4.9.1) together with (6.9) yields, for any H∈S^d++,

t=1

kg_tk²_H

t+1=

t=1

hg_tg^T_t, Ht+1i=

t=1

ψt(Ht+1) by (6.9),

≤Φ⁰(H)−Φ⁰(H₁) +

t=1

ψ_t(H) by Lemma 4.9.1,

= Φ(H)−Φ(H1)−εTr(H1) +εTr(H) +

t=1

ψt(H) by the definition ofΦ⁰,

= Φ(H)−Φ(H1)−εTr(H1) + D

εI+

t=1

gtg^T_t, H E

by the definition ofψt,

= Φ(H)−Φ(H₁)−εTr(H₁) +hG_T, Hi by the definition ofG_T,

≤Φ(H)−Φ(H1) +hG_T, Hi by Cor. 1.1.2 sinceH10.

Taking the infimum over H ∈ S^d++ on the last inequality above, which is attained since Φ is a meta-regularizer, completes the proof of (6.8).

Let us try to understand the regret bound we have just proved for an OCO instance C:= (X,F) in a game of T ∈Nrounds against an enemy oracle ENEMYfor C. LetΦ be a meta-regularizer, ε >0, and set

(x,f) := OCOC(PLAYER,ENEMY, T).

Finally, for each t∈[T] letg_t∈∂f_t(x_t)be as in the definition of AdaReg^X_Φ(f). The second term on the above regret bound, as we have already discussed (see the discussion regarding (6.4) and (6.5)), has a very nice meaning regarding the optimality of the norm which measure the sizes of the subgradients used. Namely, by settingG_T :=εI+PT

t=1g_tg^T_t and letting H₁ be as in AdaReg^X_Φ(f), we have

min

H∈S^d++

(hG_T, Hi+ Φ(H)−Φ(H1)) = min

H∈S^d++

(

t=1

kg_tk²_H +εTr(H) + Φ(H)−Φ(H1)). (6.10)

That is, the norm in the regret bound which measures the size of the subgradients is, in some sense, optimal: it is the matrix norm which minimizes the sum of the squared norms of the subgradients plus a regularization term given byΦand the trace of the matrix. That is, when choosingΦthere is a trade-off between minimizing the norms of the subgradients and minimizing the regularization termΦ(H)−Φ(H₁) in (6.10). The terms in which Φappears may clutter one’s intuition, but when we look at specific choices of Φ, the intuition on (6.10) is usually stronger. For example, in the regret bound for the AdaGrad algorithm that we study in the next section, this term becomes min{PT

t=1kg_tk²_H :H ∈S^d++,Tr(H)≤1}.

Moreover, the first term of the regret bound from Theorem 6.2.4 can be seen as a measure of stability of the choices of the matrices H_t∈S^d++ byAdaReg (scaled by the diameter ofX) for each t∈ {1, . . . , T + 1}. To see that, note that if D_t :=H_t⁻¹−[t >1]H_t−1⁻¹ for each t∈ {1, . . . , T + 1}

and x0 :=x1, then, for anyu∈X,

t=0

ku−x_tk²_D

t+1 =

t=0

(ku−x_tk²

H_t+1⁻¹ −[t >0]ku−x_tk²

H_t⁻¹).

If the matricesH_tand H_t+1 are similar for everyt∈[T], then the above terms are relatively small.

We say “relatively” since this value invariably depends on the diameter θ:= sup_x,u∈Xkx−uk²₂ of X w.r.t. the`2-norm. Thus, when picking a meta-regularizer, one wants to avoid abrupt matrix transitions from one round to another. The next corollary shows a bound in the case where θ is finite, which makes the first term of the regret bound arguably clearer to interpret: it is the diameter ofX times the sum ofTr(Dt)fort∈ {1, . . . , T+ 1}. In this case, it is clearer how both the diameter of X and the stability of the choices of the matricesH₁, . . . , H_T₊₁ affect this term simultaneously.

Lemma 6.2.5. LetA∈S^d+ and letv∈R^d. Then

v^TAv≤ kvk²₂Tr(A).

Proof. By the Cauchy-Schwarz inequality, we have

v^TAv= Tr(v^TAv) = Tr(vv^TA) =hvv^T, Ai

≤ q

Tr(vv^Tvv^T)p

Tr(A²) =kvk²₂p

Tr(A²).

Thus, it only remains tho show that p

Tr(A²)≤Tr(A). Note that, for any α, β∈R+, we have

√α+p β2

=α+ 2p

αβ+β≥α+β =⇒ √ α+p

β ≥p α+β.

Thus, by a simple induction we get X^d

i=1

u_i¹₂

≤

i=1

√u_i, ∀u∈R^d+.

Moreover, by Corollary 1.1.2 we have Tr(A²) =1^Tλ^↑(A²). Therefore, Tr(A²)¹² =X^d

i=1

λ^↑_i(A²)¹₂

=X^d

i=1

λ^↑_i(A)²¹₂

≤

i=1

λ^↑_i(A) = Tr(A).

Corollary 6.2.6. Let C:= (X,F)be an OCO instance such that X is a nonempty closed set and such that each f ∈ F is a proper closed function which is subdifferentiable on X. Let ε >0and let Φ : S^d→(−∞,+∞]be a meta-regularizer. Let T ∈N, let ENEMY be an enemy oracle forC, and define

(x,f) := OCOC(AdaReg^X_Φ,ENEMY, T).

Moreover, letH1, HT+1∈S^d₊₊ be as in the definition ofAdaReg^X_Φ(f). Finally, let GT ∈S^d₊₊be as in the definition ofAdaReg^X_Φ(f)and suppose there isθ∈R++such thatθ≥ { kx−uk²₂ :x, u∈X}.

Then, for every u∈X and forx₀:=x₁, Regret(AdaReg^X_Φ,f, u)≤ θ

2Tr(H_T⁻¹₊₁) +1

2(hG_T, HT+1i+ Φ(HT+1)−Φ(H1)).

Proof. Let u ∈ X, and, for every t ∈ {1, . . . , T + 1}, let H_t ∈ S^d++ be as in the definition of AdaReg^X_Φ(f) and define Dt:=H_t⁻¹−[t >1]H_t−1⁻¹. By Theorem 6.2.4, we have

Regret(AdaReg^X_Φ,f, u)≤ 1 2

t=0

ku−xtk²_D

t+1+1 2 min

H∈S^d++

(hG_T, Hi+ Φ(H)−Φ(H1)).

By definition we have that H_T₊₁ attains the above minimum. Thus, it only remains to bound the term PT

t=0ku−x_tk²_D

t+1. Note that

t=0

ku−x_tk²_D

t+1=

t=0

(u−x_t)^TD_t+1(u−x_T)

Le. 6.2.5

≤

t=0

ku−xtk²₂Tr(Dt+1)

≤θ

t=0

Tr(Dt+1) =θTr X^T

t=0

Dt+1

=θTr(H_T⁻¹₊₁).

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 144-153)