• Nenhum resultado encontrado

which is (√

σ/ηt)-strongly convex w.r.t. k·k. By settingx0 := x1, Corollary 5.5.2 yields, for every u∈X,

Regret(AdaDAXR,f, u)≤

T

X

t=1

(rt(u)−rt(xt)) +1 2

T

X

t=1

ηt

√σkgtk2

= 1 2

T

X

t=1

1

ηt −[t >1] 1 ηt−1

1

√σ(R(u)−R(xt)) + 1 2√ σ

T

X

t=1

ηtkgtk2

≤ θ 2ηT

√σ + 1 2√

σ

T

X

t=1

ηtkgtk2

≤ 1 2

v u u t2θ

σ

ρ+

T−1

X

t=1

kgtk2 +1

2 r θ

T

X

t=1

kgtk2 q

ρ+Pt−1 j=1kgjk2

≤ 1 2

v u u t

2θ σ

ρ+

T−1

X

t=1

kgtk2 +1

2 r θ

T

X

t=1

kgtk2 q

Pt

j=1kgjk2

Le. 4.6.2

≤ 1 2

v u u t2θ

σ

ρ+

T−1

X

t=1

kgtk2 +1

2 v u u t2θ

σ

T

X

t=1

kgtk2

≤ v u u t

2θ σ

ρ+

T−1

X

t=1

kgtk2 .

rank-one matrices based mainly on the subgradients of the previous functions picked by the enemy.

In this way, the update rule of the iterate ofAdaGrad at round t∈N\ {0} is of the form xt= ΠH

−1 t

X ([t >1](xt−1−Htgt−1)), (6.2) where X ⊆ Rd is the set from where the player is allowed to pick his points, xt−1 and gt−1 are, respectively, the iterate and subgradient of enemy’s choice at roundt−1, andΠH

−1 t

X is the projection ontoXw.r.t. the normk·kH−1

t

. Thus, the matrixHtintuitively skews the subgradient of the previous round in a desirable way, and adjusts the projection to balance the skewed subgradient step.

Another algorithm for Online Convex Optimization with a similar update rule is the Online Newton Step(ONS) algorithm [37], which is guaranteed to attain regret with a logarithmic dependence on the number of rounds if the functions played by the enemy are guaranteed to be differentiable andexp-concave, a generalization of strong convexity which will be formally described and discussed later. The algorithm’s update rule is of the same form of (6.2), with only the choice of matrixHt

being different, even though it is still a function of the subgradients of the previous choices of the enemy.

In spite of their similarities, AdaGrad and ONS were discovered independently and each had non-related analyses. The authors of [33] proposed the AdaReg algorithm and showed that both AdaGrad and ONS are special cases of AdaReg. This sheds some light in the intuition behind these algorithms. Additionally, it leaves room for the creation of other similar and interesting OCO algorithms. We describe a player oracle which implements the AdaReg algorithm in Algorithm 6.1.

The AdaReg algorithm is parameterized by a function Φ :Sd→(−∞,+∞], called meta-regularizer, which dictates which matrices to use in the update of (6.2).

Definition 6.2.1 (Meta-regularizer). A functionΦ :Sd→(−∞,+∞]is ameta-regularizerif, for any G∈Sd++,

(6.3.i) the infimuminfH∈Sd

++ hG, Hi+ Φ(H)

is attained, (6.3.ii) for any g∈E, if

HT ∈arg min

HSd++

hG, Hi+ Φ(H)

and HT+1 ∈arg min

H∈Sd++

hG+ggT, Hi+ Φ(H) , thenHT HT+1 (which impliesHT+1HT since HT and HT+1 are positive definite).

Let us look a little bit closer at the definition of AdaReg on Algorithm 6.1 for a game with T ∈Nrounds and some ε >0 and, during this discussion, we look at the reasons for the conditions imposed on meta-regularizers. Let C := (X,F) be an OCO instance, let T ∈N, and letf ∈ FT. Moreover, let t∈ {0, . . . , T −1}. At round t+ 1, i.e. when the algorithm is computingxt+1, the algorithm builds a positive definite matrix Gt, which is the sum of rank-one matrices (based on the subgradients of the enemy’s functions) plus4 εI. Then AdaReg performs its key step: the choice of the matrix Ht+1 which it uses to perform the “skewed” gradient step as in (6.2). Namely, AdaReg with meta-regularizer Φpicks Ht+1 that attains

inf

HSd++

(hGt, Hi+ Φ(H)), (6.4)

4The main goal of this latter term is to ensure thatGtis invertible, but the value ofε >0may affect the guarantees of the algorithms we shall see later on.

Algorithm 6.1 Definition ofAdaRegXΦ hf1, . . . , fTi Input:

(i) A closed convex set X⊆Rd,

(ii) Convex functions f1, . . . , fT ∈ F for some T ∈ N and F ⊆ (−∞,+∞]Rd such that ft is subdifferentiable onX for eacht∈[T],

(iii) A meta-regularizerΦ : Sd→(−∞,+∞],

(iv) A real number ε >0 (usually clear from the context) Output: xT+1∈X

G0 ←εI

LetH1 ∈arg minH∈

Sd++(hG0, Hi+ Φ(H)) Let{x1} ←arg minx∈XkxkH−1

1 = arg minx∈XxTH1−1x fort= 1 to T do

. Computations for roundt+ 1 Computegt∈∂ft(xt)

Gt←Gt−1+gtgTt

ComputeHt+1 ∈arg minHSd

++(hGt, Hi+ Φ(H)) xt+1←ΠH

−1 t+1

X (xt−Ht+1gt) return xT+1

where the above infimum is attained by property (6.3.i). Although the above expression can seem cryptic at first, it has a very elegant interpretation. By the definition of the AdaReg oracle, we have Gt=εI+Pt

i=1gigiT, where for eacht∈[T]the vectorgt∈Rd is a subgradient as defined in AdaRegXΦ(f1:t). Thus, for everyH ∈Sd++ we have

hGt, Hi+ Φ(H) =

t

X

i=1

hgigiT, Hi+εTr(H) + Φ(H) =

t

X

i=1

Tr(gigTi H) +εTr(H) + Φ(H)

=

t

X

i=1

giTHgi+εTr(H) + Φ(H) =

t

X

i=1

kgik2H +εTr(H) + Φ(H).

(6.5)

That is, the matrix Ht+1 is chosen so that the size of the subgradients measured by its induced norm are minimized while still not making Φ(H) +εTr(H) too high5. Recall that the sum of the squared norms of the subgradients is directly connected to almost all the regret bounds seen on Chapters 4 and 5. Thus,Ht+1 can be seen roughly as the best matrix with low complexity w.r.t. the meta-regularizer Φthrough which to measure/see the subgradients of the functions played by the enemy so far. Another way to see the choice of Ht+1, which is the main idea the authors of [33]

use in their analysis of AdaReg, is to note thatHt+1 is the point picked byFTRLΦ0(hψ1, . . . , ψti), where ψi(H) :=hgigTi , Hi for eachi∈[t]andΦ0:= Φ +εTr(·) +δ(· |Sd++). That is, the problem of choosing a matrix norm through which to measure the subgradients played by the enemy is seen as a separate OCO instance! On the regret bounds which we prove later in this section it will be clear how well this strategy minimizes the norms of the subgradients.

The reader may still be confused about condition (6.3.ii) since, during the above discussion, this condition was never mentioned. Not only that, the AdaReg oracle from Algorithm 6.1 does not seem to need this condition for all of its operations to be well-defined. Indeed, condition (6.3.ii) from the

5The value ofΦ(H) +εTr(H)here can be interpreted as the “complexity” of the normk·kH.

definition of meta-regularizers is not needed for the definition of AdaReg to make sense. However, as we shall soon see, this condition is fundamental for the regret bounds that we derive to hold.

Interestingly, even though condition (6.3.ii) is not explicitly stated on [33], all the meta-regularizers the authors use satisfy this condition (which is used explicitly in their proofs).

As one may have noticed, the update on (6.2) resembles a lot the update from the Adaptive Online Mirror Descent. Indeed, to bound the regret of AdaReg we will write it as an Adaptive Online Mirror Descent algorithm with a carefully6 crafted mirror map strategy, which we formally define in Algorithm 6.2.

Algorithm 6.2 Definition of

M(X,Φ, ε)

hf1, . . . , fTi Input:

(i) A closed convex set X⊆Rd,

(ii) Convex functions f1, . . . , fT ∈ F for some T ∈ N and F ⊆ (−∞,+∞]Rd such that ft is subdifferentiable onX for eacht∈[T],

(iii) A meta-regularizerΦ : Sd→(−∞,+∞], (iv) A real number ε >0

Output: A functionrT+1:Rd→(−∞,+∞]

fort= 1 to T do .Capture subgradients used on rounds1, . . . , T xt←AdaOMDXM(X,Φ,ε)(hf1, . . . , ft−1i)

. Equip the right well-order to match the subgradient choice ofAdaOMDXM(X,Φ,ε) Equip∂ft(xt) with the same well-order used byAdaOMDXM(X,Φ,ε)

Pickgt∈∂ft(xt)

.Compute the mirror map increment for round T+ 1 GT−1 ←εI+PT−1

t=1 gtgtT GT ←GT−1+gTgTT LetHT ∈arg minH∈Sd

++(hGT−1, Hi+ Φ(H)) LetHT+1 ∈arg minH

Sd++(hGT, Hi+ Φ(H)) DT+1 ←HT−1+1−[T >0]HT−1

return x∈Rd7→ 12xTDT+1x= 12kxk2D

T+1

Note that ifΦon the definition ofMin Algorithm 6.2 is a meta-regularizer, then the matricesDt on the definition ofMare positive semidefinite by condition (6.3.ii). That is, the functions delivered by Mare always convex in this case. This is important if we want to writeAdaReg in the form of an AdaOMD algorithm, since in order forMto be a mirror map strategy (and for us to apply the regret bounds we have proved on Chapter 5), we need the functions it delivers at each round, i.e.

the mirror map increments, to be convex. In the following lemma we prove that if we plug intoMa meta-regularizer, thenMis a mirror map strategy and the mirror map it builds at each round are scaled squared matrix norms.

Lemma 6.2.2. Let C := (X,F) be an OCO instance such that X is a closed set and such that eachf ∈ F is a proper closed function which is subdifferentiable on X. Moreover, let ε >0 and let Φ : Sd→(−∞,+∞]be a meta-regularizer. Let T ∈N, let f ∈ FT, and letHt∈Sd++ andDt∈Sd

6One may note that we need to ensure the subgradients used by the mirror map strategy matches the ones used by the AdaOMD oracle, and we do so by synchronizing the well-orders used on the subdifferentials by the AdaOMD oracle and by the mirror map strategy. See the discussion following Definition 4.2.1 to recall why we equip well-orders to the subdifferentials used.

be as defined inM(X,Φ, ε)(f1:t−1) for eacht∈ {1, . . . , T + 1}. Finally, for every t∈ {1, . . . , T + 1}

define

rt:=M(X,Φ, ε))(f1:t−1) and Rt:=

t

X

i=1

rt.

Then M(X,Φ, ε) is a mirror map strategy for C which is differentiable onRd. Moreover, for every t ∈ {1, . . . , T + 1} we have Dt 0, rt = 12k·k2D

t, and Rt = 12k·k2

Ht−1. Moreover, Rt is 1-strongly convex w.r.t. k·kH−1

t on Rd for every t∈ {1, . . . , T + 1}.

Proof. Let t∈ {1, . . . , T+ 1}. First, note that sinceΦis a meta-regularizer, by condition (6.3.ii) we have that Dt0. Let us now show that

rt= 12k·k2D

t and Rt= 12k·k2

Ht−1. (6.6)

Note that the form ofrt as in (6.6) holds by the definition of[M(X,Φ, ε)](f1:t−1). Moreover, for everyx∈Rd we have

Rt(x) =

t

X

i=1

ri(x) =

t

X

i=1

1

2xTDix= 1

2xTXt

i=1

(Hi−1−[i >1]Hi−1−1) x= 1

2xTHt−1x.

This proves (6.6). Let us now show that

(6.7) M(X,Φ, ε)is a mirror map strategy for Cwhich is differentiable onRdand such

that Rt is1-strongly convex w.r.t. k·kH−1 t onE.

First, note that rt is two-times continuously differentiable (and, thus, closed) with ∇2rt(x) =Dt for anyx∈Rd. SinceDt0by the conditions of a meta-regularizer, by Lemma 3.1.1 we conclude that rt is convex. It only remains to show that Rt is a mirror forX. That is, we need to prove that

(i) Rt closed, proper,1-strongly convex7 on Rdw.r.t. k·kH−1

t and differentiable on Rd, (ii) Rd= int(domRt),

(iii) for anyy∈Rd, the infima infx∈XBRt(x, y)and infx∈XRt(x) are attained, and (iv) { ∇R(x) : x∈Rd}=Rd.

First, note that (ii) clearly holds, and since ∇Rt(x) =Ht−1x for any x∈Rd, we conclude that (iv) holds sinceHt−1 is invertible. Moreover,Rt is two-times continuously differentiable on Rd, which implies that Rt is proper and closed (in fact, continuous), and since ∇2Rt(x) = Ht−1 0 for any x∈Rd, by Lemma 3.1.1 we conclude that Rt is convex. Note that ifRtis strongly convex, BRy(·, y) also is for anyy∈Rd, and then then the infima from (iii) would be attained by Lemma 3.9.14. Thus, it only remains to show that Ht is1-strongly convex w.r.t. k·kH−1

t

. To see that8, note that for every

7The definition of mirror map requires strict convexity, but recall that strong convexity implies strict convexity by definition.

8One easier way to prove strong convexity ofRt is to note thatk·kH−1

t is a norm induced by the inner product (x, y)Rd×Rd7→xTHt−1yand then use Lemma 3.9.5. However, using direct computations seems less cumbersome

in this case

x, y∈Rd we have

1

2kx−yk2

Ht−1 = 12(x−y)THt−1(x−y) = 12xTHt−1x+ 12yTHt−1y−xTHt−1y

= 12xTHt−1x+−12yTHt−1y−(Ht−1y)T(x−y)

= 12kxk2

Ht−1 +−12kyk2

Ht−1−(Ht−1y)T(x−y)

=Rt(x)−Rt(y)− ∇Rt(y)T(x−y).

By Theorem 3.9.7 we conclude thatRtis1-strongly convex w.r.t. k·kH−1

t , which concludes the proof of (6.7).

With the above lemma, we have the guarantee thatMapplied to a meta-regularizer and other properly chosen parameters is indeed a mirror map. In the next theorem we prove the main result of this section: ifC:= (X,F)is an OCO instance,ε >0, andΦ : Sd→(−∞,+∞]is a meta-regularizer, thenAdaRegXΦ = AdaOMDXM(X,Φ,ε). This theorem will allow us to derive regret bounds forAdaReg and the Online Newton Step algorithm from the regret bounds we have for AdaOMD.

Theorem 6.2.3. LetC:= (X,F) be an OCO instance such that X is a nonempty closed set and such that each f ∈ F is a proper closed function which is subdifferentiable on X. Moreover, let ε >0 andΦ :Sd→(−∞,+∞] be a meta-regularizer. Finally, suppose the same well-order9 over the sets used in the definition ofAdaRegXΦ are the same as in the definition of AdaOMDXM(X,Φ,ε). Then, for any T ∈N andf ∈ FT,

AdaRegXΦ(f) = AdaOMDXM(X,Φ,ε)(f),

and the matrix HT+1 ∈ Sd++ as defined in AdaRegXΦ(f) is equal to the matrix HT+1 ∈ Sd++ as defined inM(X,Φ, ε)(f).

Proof. LetT ∈Nand f ∈ FT. Moreover, for eacht∈ {1, . . . , T+ 1} define Rt:=

t

X

i=1

[M(X,Φ, ε)](f1:t−1) and xt:= AdaOMDXM(X,Φ,ε)(f1:t−1).

Finally, for each t∈ {1, . . . , T + 1} let Ht ∈Sd++ be defined as in [M(X,Φ, ε)](f1:t−1). Note that for everyt∈ {1, . . . , T + 1} we have thatHt is the same as the one in AdaRegXΦ(f) by definition and since all the sets used in the definitions of the AdaReg and AdaOMD oracles are the same. Let us prove by induction ont∈ {1, . . . , T+ 1} that

xt= AdaRegXΦ(f1:t−1), ∀t∈ {1, . . . , T + 1}.

Fort= 1, we haveR1(x) = 12k·k2

H1−1. Thus, the definition ofAdaOMDXM(X,Φ,ε) yields {x1}= arg min

x∈X

R1(x) = arg min

x∈X

kxkH−1

1 ={AdaRegXΦ(hi)}.

Lett∈ {2, . . . , T + 1}. By Lemma 6.2.2, we have thatRt= 12k·k2

Ht−1. With that, we have yt:=∇Rt(xt−1)−gt−1 =Ht−1xt−1−gt−1.

9This is only a technical assumption to assure that, if we have in both algorithms a statement such as “let gt∂ft(xt)”, then in both algorithms the element picked from the set∂ft(xt)is the same.

By Lemma 3.8.5, the dual norm of k·kH−1

t is k·kHt, and by Theorem 3.8.2 we have Rt = 12k·k2H

t. This together with the definition of AdaOMD(f1:t−1)yields

xt= ΠH

−1 t

X (∇Rt(yt)) = ΠH

−1 t

X (Ht(Ht−1xt−1−gt−1))

= ΠH

−1 t

X (xt−1−Htgt−1) = AdaRegXΦ(f1:t−1).

Finally, let us now show a regret bound for the AdaReg algorithm. The proof of the next regret bound has two key steps. The first is almost obvious given the previous theorem: use the regret bound forAdaOMDwe have proved previously (see Theorem 5.4.3). This together with the previous theorem yields a regret bound which is arguably not very useful. The second key step in the next proof is to use the FTL–BTL Lemma on the matricesHtused by AdaRegto show the optimality, in some sense, of this choice of matrices with respect to the minimization of the norms of the subgradients. This yields a neat regret bound, whose intuition we discuss after proving the next theorem.

Theorem 6.2.4. LetC:= (X,F)be an OCO instance such that X⊆Rdis a nonempty closed set and such that eachf ∈ F is a proper closed function which is subdifferentiable onX. Let ε >0and let Φ : Sd →(−∞,+∞] be a meta-regularizer. LetT ∈N, let ENEMYbe an enemy oracle for C, and define

(x,f) := OCOC(AdaRegXΦ,ENEMY, T).

For each t ∈ {1, . . . , T + 1}, let Ht ∈Sd++ be as in the definition of AdaRegXΦ(f1:t−1) and define Dt:=Ht−1−[t >1]Ht−1−1. Finally, let GT ∈Sd++ be as in the definition ofAdaRegXΦ(f). Then, for any u∈X and for x0 :=x1,

Regret(AdaRegXΦ,f, u)≤ 1 2

T

X

t=0

ku−xtk2Dt+1+1 2 min

H∈Sd++

(hGT, Hi+ Φ(H)−Φ(H1)).

Proof. By Theorem 6.2.3, we haveAdaRegXΦ = AdaOMDXM(X,Φ,ε) and, in particular Regret(AdaRegXΦ,f, u) = Regret(AdaOMDXM(X,Φ,ε),f, u), ∀u∈Rd.

Thus, it suffices to bound the right hand side of the above equation. For everyt∈ {1, . . . , T + 1}, define

rt:= [M(X,Φ, ε)](f1:t−1) and Rt:=

t

X

i=1

ri.

By Lemma 6.2.2 we know that M(X,Φ, ε) is a mirror map strategy for C and that for every t∈ {1, . . . , T + 1}we have rt= 12k·k2D

t and Rt= 12k·k2

Ht−1, the latter being 1-strongly convex w.r.t.

k·kH−1

t on Rd. By Lemma 3.8.5 we have that the dual norm ofk·kH−1

t isk·kHt. Finally, setx0:=x1 and letgt∈∂ft(xt)be as in the definition ofAdaOMDM(X,Φ,ε)(f)for everyt∈[T]. Then, for every u∈Rd Theorem 5.4.3 yields

Regret(AdaOMDXM(X,Φ,ε),f, u)≤

T+1

X

t=1

Brt(u, xt−1) +1 2

T

X

t=1

kgtk2H

t+1

= 1 2

T+1

X

t=1

ku−xt−1k2Dt +1 2

T

X

t=1

kgtk2Ht+1.

Thus, it only remains to show that

T

X

t=1

kgtk2Ht+1 ≤ min

H∈Sd++

(hGT, Hi+ Φ(H)−Φ(H1)). (6.8) Lett∈[T]. Note that

kgtk2H

t+1 =gTtHt+1gt= Tr(gTtHt+1gt) = Tr(gtgTtHt+1) =hgtgtT, Ht+1i. (6.9) DefineΦ0 := Φ +εTr(·) +δ(· |Sd++). By definition ofM(X,Φ, ε), we have

Ht∈arg min

H∈Sd++

Xt−1

i=1

hgigTi , Hi+εhI, Hi+ Φ(H)

= arg min

H∈Sd

Xt−1

i=1

hgigiT, Hi+ Φ0(H) . Thus, by setting ψt(H) :=hgtgtT, Hifor every t∈[T]and H ∈Sd, we conclude that

Ht= FTRLΦ(hψ1, . . . , ψt−1i), ∀t∈ {1, . . . , T+ 1}.

Therefore, the FTL–BTL Lemma (Lemma 4.9.1) together with (6.9) yields, for any H∈Sd++,

T

X

t=1

kgtk2H

t+1=

T

X

t=1

hgtgTt, Ht+1i=

T

X

t=1

ψt(Ht+1) by (6.9),

≤Φ0(H)−Φ0(H1) +

T

X

t=1

ψt(H) by Lemma 4.9.1,

= Φ(H)−Φ(H1)−εTr(H1) +εTr(H) +

T

X

t=1

ψt(H) by the definition ofΦ0,

= Φ(H)−Φ(H1)−εTr(H1) + D

εI+

T

X

t=1

gtgTt, H E

by the definition ofψt,

= Φ(H)−Φ(H1)−εTr(H1) +hGT, Hi by the definition ofGT,

≤Φ(H)−Φ(H1) +hGT, Hi by Cor. 1.1.2 sinceH10.

Taking the infimum over H ∈ Sd++ on the last inequality above, which is attained since Φ is a meta-regularizer, completes the proof of (6.8).

Let us try to understand the regret bound we have just proved for an OCO instance C:= (X,F) in a game of T ∈Nrounds against an enemy oracle ENEMYfor C. LetΦ be a meta-regularizer, ε >0, and set

(x,f) := OCOC(PLAYER,ENEMY, T).

Finally, for each t∈[T] letgt∈∂ft(xt)be as in the definition of AdaRegXΦ(f). The second term on the above regret bound, as we have already discussed (see the discussion regarding (6.4) and (6.5)), has a very nice meaning regarding the optimality of the norm which measure the sizes of the subgradients used. Namely, by settingGT :=εI+PT

t=1gtgTt and letting H1 be as in AdaRegXΦ(f), we have

min

H∈Sd++

(hGT, Hi+ Φ(H)−Φ(H1)) = min

H∈Sd++

(

T

X

t=1

kgtk2H +εTr(H) + Φ(H)−Φ(H1)). (6.10)

That is, the norm in the regret bound which measures the size of the subgradients is, in some sense, optimal: it is the matrix norm which minimizes the sum of the squared norms of the subgradients plus a regularization term given byΦand the trace of the matrix. That is, when choosingΦthere is a trade-off between minimizing the norms of the subgradients and minimizing the regularization termΦ(H)−Φ(H1) in (6.10). The terms in which Φappears may clutter one’s intuition, but when we look at specific choices of Φ, the intuition on (6.10) is usually stronger. For example, in the regret bound for the AdaGrad algorithm that we study in the next section, this term becomes min{PT

t=1kgtk2H :H ∈Sd++,Tr(H)≤1}.

Moreover, the first term of the regret bound from Theorem 6.2.4 can be seen as a measure of stability of the choices of the matrices Ht∈Sd++ byAdaReg (scaled by the diameter ofX) for each t∈ {1, . . . , T + 1}. To see that, note that if Dt :=Ht−1−[t >1]Ht−1−1 for each t∈ {1, . . . , T + 1}

and x0 :=x1, then, for anyu∈X,

T

X

t=0

ku−xtk2D

t+1 =

T

X

t=0

(ku−xtk2

Ht+1−1 −[t >0]ku−xtk2

Ht−1).

If the matricesHtand Ht+1 are similar for everyt∈[T], then the above terms are relatively small.

We say “relatively” since this value invariably depends on the diameter θ:= supx,u∈Xkx−uk22 of X w.r.t. the`2-norm. Thus, when picking a meta-regularizer, one wants to avoid abrupt matrix transitions from one round to another. The next corollary shows a bound in the case where θ is finite, which makes the first term of the regret bound arguably clearer to interpret: it is the diameter ofX times the sum ofTr(Dt)fort∈ {1, . . . , T+ 1}. In this case, it is clearer how both the diameter of X and the stability of the choices of the matricesH1, . . . , HT+1 affect this term simultaneously.

Lemma 6.2.5. LetA∈Sd+ and letv∈Rd. Then

vTAv≤ kvk22Tr(A).

Proof. By the Cauchy-Schwarz inequality, we have

vTAv= Tr(vTAv) = Tr(vvTA) =hvvT, Ai

≤ q

Tr(vvTvvT)p

Tr(A2) =kvk22p

Tr(A2).

Thus, it only remains tho show that p

Tr(A2)≤Tr(A). Note that, for any α, β∈R+, we have

√α+p β2

=α+ 2p

αβ+β≥α+β =⇒ √ α+p

β ≥p α+β.

Thus, by a simple induction we get Xd

i=1

ui12

d

X

i=1

√ui, ∀u∈Rd+.

Moreover, by Corollary 1.1.2 we have Tr(A2) =1Tλ(A2). Therefore, Tr(A2)12 =Xd

i=1

λi(A2)12

=Xd

i=1

λi(A)212

d

X

i=1

λi(A) = Tr(A).

Corollary 6.2.6. Let C:= (X,F)be an OCO instance such that X is a nonempty closed set and such that each f ∈ F is a proper closed function which is subdifferentiable on X. Let ε >0and let Φ : Sd→(−∞,+∞]be a meta-regularizer. Let T ∈N, let ENEMY be an enemy oracle forC, and define

(x,f) := OCOC(AdaRegXΦ,ENEMY, T).

Moreover, letH1, HT+1∈Sd++ be as in the definition ofAdaRegXΦ(f). Finally, let GT ∈Sd++be as in the definition ofAdaRegXΦ(f)and suppose there isθ∈R++such thatθ≥ { kx−uk22 :x, u∈X}.

Then, for every u∈X and forx0:=x1, Regret(AdaRegXΦ,f, u)≤ θ

2Tr(HT−1+1) +1

2(hGT, HT+1i+ Φ(HT+1)−Φ(H1)).

Proof. Let u ∈ X, and, for every t ∈ {1, . . . , T + 1}, let Ht ∈ Sd++ be as in the definition of AdaRegXΦ(f) and define Dt:=Ht−1−[t >1]Ht−1−1. By Theorem 6.2.4, we have

Regret(AdaRegXΦ,f, u)≤ 1 2

T

X

t=0

ku−xtk2D

t+1+1 2 min

H∈Sd++

(hGT, Hi+ Φ(H)−Φ(H1)).

By definition we have that HT+1 attains the above minimum. Thus, it only remains to bound the term PT

t=0ku−xtk2D

t+1. Note that

T

X

t=0

ku−xtk2D

t+1=

T

X

t=0

(u−xt)TDt+1(u−xT)

Le. 6.2.5

T

X

t=0

ku−xtk22Tr(Dt+1)

≤θ

T

X

t=0

Tr(Dt+1) =θTr XT

t=0

Dt+1

=θTr(HT−1+1).