• Nenhum resultado encontrado

That is, the trace in the regret of AdaGradhas value close12 to the square-root of sum of the norms of the gradients, where the norm is the best among all norms induced by matricesH−1 with H∈Sd++ in the spectraplex Sd, that is, such that Tr(H) = 1. Note that, for anyg∈Rd, by setting H¯ :=d−1H∈ Sdwe have

kgk2H¯ = 1dgTIg= 1dkgk22 and Tr( ¯H−1) =dTr(I) =d2. (6.13) Thus, forε >0 small enough (namely, smaller thand−2), we conclude that the regret bound from Theorem 6.3.2 as good as the one from Theorem 6.1.1.

Diagonal Adaptive Gradient algorithm. Not only that, this allows us to use many of the results from the previous section as stepping stones to derive the results for this section.

Throughout the remainder of this section, we denote by Dd := {Diag(x)∈Sd: x∈Rd} the set of d×d diagonal matrices. Moreover, we overload the Diag operator and for every A ∈ Sd defineDiag(A) := Diag(diag(A)). That is, Diag(A) is equal to the matrixA but with zeroes on its off-diagonal entries. Let us show the form and some properties of the meta-regularizer we will use to derive the DiagAdaGrad algorithm, proving first a simple lemma about the normal cone and the relative interior of Dd.

Lemma 6.4.1. We have ri(Dd) =Dd and

NDd( ˜A) ={A∈Sd: diag(A) = 0}, ∀A˜∈D.

Proof. First, note that for anyA,˜ B˜ ∈Dd, we have(1−µ) ˜A+µB˜ ∈Dfor anyµ∈R. Thus,Dd is an affine set andri(Dd) =Dd.

Let A˜ ∈Dd, let A ∈ Sd, and define ¯a:= diag( ˜A). Note that hA, X −Ai ≤˜ 0 for any X ∈ Dd if and only if hA,Diag(x−¯a)i ≤ 0 for any x ∈ Rd. Moreover, hA,Diag(x−a)i ≤¯ 0 for every x ∈ Rd if and only if diag(A)T(x−¯a) ≤ 0 for every x ∈ Rd. That is, A ∈ NDd( ˜A) if and only if diag(A)∈NRd(¯a) ={0}.

Lemma 6.4.2. Letη >0, definef:Rd→(−∞,+∞]by f(x) :=η2

d

X

i=1

[xi6= 0]1 xi

, ∀x∈Rd, and setΦ :=fS+δ(· |Dd). Then,

Φ(H) =η2Tr(H−1) +δ(H|Dd), ∀H∈Sd++, (6.14) and for every G∈Sd++ the infimum infH

Sd++(hG, Hi+ Φ(H))is attained by η(Diag(G))−1/2. In particular, for everyε >0 we haveDiagAdaGradX = AdaRegXΦ for every nonempty and closed set X⊆Rd.

Proof. Note that (6.14) follows directly from Lemma 6.3.1. LetG∈Sd++. Let us show {η(Diag(G))−1/2}= arg min

HSd++

(hG, Hi+ Φ(H)) (6.15)

Since Φis proper and infinite outside of Dd, it is clear that the infimum can only be attained by a matrix inDd∩Sd++. LetH¯ ∈Dd∩Sd++ and define ΦAdaGrad(H) :=η2Tr(H−1) for every H ∈Sd++. Note thatΦ = ΦAdaGrad+δ(· |Dd) in this case. By Lemma 6.4.1 we have ri(Dd) = Dd. Thus, we have (riSd++)∩(ri(Dd)) =Sd++∩Dd6=∅. Thus, formula for the subdifferential of the sum of convex functions from Theorem 3.5.4 together with the differentiability of ΦAdaGrad from Lemma 6.3.1,

∂Φ( ¯H) =∇ΦAdaGrad( ¯H) +NDd( ¯H) =−η2−2+NDd( ¯H).

Thus, by the optimality conditions from Theorem 3.6.2, H¯ attains the infimum in (6.15) if and only if

−(GT +∂Φ( ¯H))∩N

Sd++( ¯H)6=∅ ⇐⇒ 0∈(GT +∂Φ( ¯H)) =GT −η2−2+NDd( ¯H).

The above holds if and only if there is A∈NDd( ¯H) such thatη2−2 =G+A. SinceH¯ ∈Dd, we have η2−2 ∈Dd and, therefore,G+A∈Dd. Hence, we haveG+A= Diag(G+A)andH¯ attains the infimum in (6.15) if and only if

−2= 1

η2(G+A) = 1

η2 Diag(G+A) = 1

η2Diag(G),

where in the last equation we have used that diag(A) = 0 by Lemma 6.4.1. Thus, H¯ attains the infimum from (6.15) if and only ifH¯ = 1ηDiag(G)−1/2, which proves (6.15).

Now let us show that Φ is a meta-regularizer. Let T ∈ N\ {0} and g ∈ (Rd)T. Moreover, let ε >0 and set GT−1 :=εI+PT−1

t=1 gtgtT andGT :=GT−1+gTgTT. Condition (6.3.i) is satisfied by Φsince by (6.12) we know thatinfHSd

++(hH, GTi+ Φ(H))is attained byηDiag(GT)−1/2. Set HT+1:=ηDiag(GT)−1/2 andHT :=ηDiag(GT−1)−1/2. By definition,

HT−1+1−HT−1 = 1η(Diag(GT)1/2−Diag(GT−1)1/2).

SinceGT andGT−1are positive semidefinite, we have thatdiag(GT)anddiag(GT−1)are non-negative.

Moreover, since GT −GT−1 =gTgTT 0, for every i∈[d]we

(GT)i,i=eTi GTei ≥eTiGT−1ei= (GT−1)i,i =⇒ (GT)i,i ≥(GT−1)i,i =⇒ (GT)1/2i,i ≥(GT−1)1/2i,i . Sinceη >0, we conclude that 1ηdiag(GT)1/2η1diag(GT−1)1/2, which proves that 1η(Diag(GT)1/2− Diag(GT−1)1/2) is positive semidefinite, that is,Φsatisfies condition (6.3.ii). This finishes the proof thatΦ is a meta-regularizer.

Last but not least, let us show thatDiagAdaGradX = AdaRegXΦ for anyε >0and any closed and convex set∅6=X⊆Rd(recall thatη >0is already given by the statement). LetX⊆Rdbe a closed convex and nonempty set, let ε >0, and let f :=hf1, . . . , fTi ∈Seq((−∞,+∞]Rd) be such that the functionft is subdifferentiable on X for everyt∈[T]. Let us show thatDiagAdaGradX(f1:t−1) = AdaRegXΦ(f1:t−1)by induction ont∈[T]. Setx1 := AdaRegXΦ(hi)and letH1 be as in the definition of AdaRegXΦ(hi). By (6.15), we know that H1= (η/√

ε)I. Thus, x1∈arg min

x∈X

kxkH−1

1 = arg min

x∈X

xTH1−1x= arg min

x∈X

ε

η xTx= arg min

x∈X

kxk2.

Since k·k22 is strictly convex (see Lemma 3.9.5), the above minimizer is unique and, thus, we havex1= DiagAdaGradX(hi). Let t∈ {2, . . . , T + 1}, and let gt−1∈Rd andGt−1 ∈Sd++ be as in the definition ofxt:= AdaRegXΦ(f1:t−1). One may note thatg1, . . . , gt−1∈Rd as in the definition of DiagAdaGradXΦ(f1:t−1)matches g1, . . . , gt−1 as in the definition of AdaRegXΦ(f1:t−1) with a proper choice of well-order on the subdifferentials used by the oracles. In this case, by defining G˜t−1 :=

εI+Pt−1

i=1gigiTas in the definition ofDiagAdaGrad(f1:t−1), we haveG˜t−1 = Diag(Gt−1). Finally, let Ht be as in the definition ofAdaRegXΦ(f1:t−1), set xt−1 := AdaRegXΦ(f1:t−2) = AdaGradX(f1:t−2) (where the equation holds by induction), and define G˜:= Diag(Gt−1) = ˜Gt−1. Then,

xt= ΠH

−1 t

X (xt−1−Htgt−1)(6.12)= ΠGX˜−1/2(xt−1−ηG˜1/2gt−1) = AdaGradX(f1:t−1).

Finally, we are in place to prove a regret bound for the Diagonal AdaGrad algorithm.

Theorem 6.4.3. LetC:= (X,F) be an OCO instance such that X is a nonempty closed set and such that each f ∈ F is a proper closed functions which is subdifferentiable on X. Let ε > 0,

let T ∈ N, and let ENEMY be an enemy oracle for C. Suppose there is θ ∈ R++ such that θ≥sup{ ku−xk2:u, x∈X} and setη:=p

θ/2for DiagAdaGrad. Finally, define (x,f) := OCOC(DiagAdaGradX,ENEMY, T)

and letG˜T ∈Sd++ be as in the definition ofDiagAdaGradX(f). Then Regret(AdaRegXΦ,f, X)≤√

2θTr( ˜G1/2T ).

Proof. Define f: Rd → (−∞,+∞] by f(x) := η2Pd

i=1[xi 6= 0]x1

i for each x ∈ Rd and set Φ :=

fS+δ(· |Dd). By Lemma 6.4.2, we have Φ(H) =η2Tr(H−1) +δ(H|Dd) for every H ∈Sd++ and AdaRegXΦ = AdaGradX if a proper well-order is equipped to the sets used by AdaRegXΦ. In this case, we only to bound the regret ofAdaRegXΦ. For eacht∈ {1, . . . , T+ 1}, let Ht, Gt−1 ∈Sd++ be as in the definition ofAdaRegXΦ(f), set G˜t−1 := Diag(Gt−1) (which matches the definition ofG˜t−1

onDiagAdaGradX), define Dt:=Ht−1−[t >1]Ht−1−1, and let u∈Rd. Thus, by Theorem 6.2.4 with x0 :=x1 we have

Regret(AdaRegXΦ,f, u)≤ 1 2

T

X

t=0

ku−xtk2D

t+1+1 2 min

H∈Sd++

(hGT, Hi+ Φ(H)−Φ(H1)). (6.16) Let us bound each of the above terms separately. First, let us show that

T

X

t=0

ku−xtk2D

t+1 ≤√

2θTr( ˜G1/2T ). (6.17) By the definition of the matricesH1, . . . , HT+1 and by Lemma 6.4.2 we have Ht=ηG˜−1/2t−1 for every t∈ {1, . . . , T + 1}. Thus, Ht andDt are diagonal matrices with positive diagonal entries (the latter holds sinceGt−10). Thus, for every t∈ {1, . . . , T + 1} and anyv∈Rd,

vTDt+1v=

d

X

i=1

vi2(Dt+1)i,i ≤ kvk2

d

X

i=1

(Dt+1)i,i=kvk2Tr(Dt+1). (6.18) Therefore, using, among other facts, that HT+1=ηG˜−11/2T by Lemma 6.4.2, we have

T

X

t=0

ku−xtk2D

t+1=

T

X

t=0

(u−xt)TDt+1(u−xT)

(6.18)

T

X

t=0

ku−xtk2Tr(Dt+1)

≤θ

T

X

t=0

Tr(Dt+1) =θTrXT

t=0

Dt+1

=θTr(HT−1+1)Le. 6.4.2= θ

ηTr( ˜G1/2T )

=

2θTr( ˜G1/2T ).

This proves (6.17). Let us now show min

H∈Sd++

(hG˜T, Hi+ Φ(H)−Φ(H1))≤√

2θTr( ˜G1/2T ). (6.19)

By Lemma 6.4.2, we have that the above minimum is attained byηG˜−1/2T and H1 = (η/√

ε)I since, by definition,H1 ∈arg minHSd

++(hG˜0, Hi+ Φ(H))and G˜0 =εI. Therefore, min

HSd++

(hG˜T, Hi+ Φ(H)−Φ(H1)) =ηTr( ˜G1/2T ) + Φ(ηG˜−1/2T )−Φ(ηεI)

=ηTr( ˜G1/2T ) +ηTr( ˜G1/2T )−η√ εTr(I)

≤2ηTr( ˜G1/2T ) =√

2θTr( ˜G1/2T ),

which proves (6.19). Plugging (6.17) and (6.19) into the regret bound from (6.16) completes the proof of the statement.

Again, one may find it hard to find any meaning on the trace in the regret bound in the above theorem. Since the matrices used by DiagAdaGrad are diagonal, there is a simpler formula for the trace. Namely, let ε >0, let g ∈ (Rd)T for someT ∈N, and define G˜T := εI+PT

t=1diag(gtgtT).

Then, one may verify that

Tr( ˜G1/2T ) =

d

X

i=1

v u u tε+

T

X

t=1

gt(i)2.

Still, the above formula may not be very informative. Let us prove a proposition similar to Proposition 6.4.4 which sheds some light into the meaning of the above trace.

Proposition 6.4.4. Let A ∈ Sd++∩Dd. Then inf{Tr(X−1A) : X ∈Sd++,Tr(X) = 1} is attained by Tr(A1/2)−1A1/2.

Proof. Definea:= diag(A). Since A∈Dd, we have A= Diag(a). Additionally, note that inf{ hX−1, Ai:X∈Sd++∩Dd,Tr(X) = 1}= inf{ hDiag(x)−1,Diag(a)i:x∈Rd++∩∆d}

= inf ( d

X

i=1

ai

xi :x∈Rd++∩∆d )

.

Not only that, we also have that x¯ ∈ Rd++∩∆d attains the last infimum above if and only if X:= Diag(x)−1 attains the first infimum above. Define x¯∈∆dby

¯

xi := a1/2i Pd

i=1a1/2i

, ∀i∈[d].

Note that

Diag(¯x) = 1 Pd

i=1a1/2i Diag(a)1/2= 1

Tr(Diag(a)1/2)Diag(a)1/2 = 1

Tr(A1/2)A1/2. Thus, to prove the statement, it suffices to show that

¯

x∈arg min ( d

X

i=1

ai xi

:x∈Rd++∩∆d )

. (6.20)

Define the convex functionc:Rd→(−∞,+∞]by c(x) =

d

X

i=1

[xi >0]ai

xi

+δ(x|Rd++), ∀x∈Rd.

First of all, note that cis closed. Indeed,c is continuous onRd++ and, for everyx¯∈Rd+\Rd++, lim inf

x→¯x c(x) = +∞=c(¯x).

Moreover, note that

(∇c(x))i=−ai

x2i, ∀i∈[d],∀x∈Rd++. Thus, for every x∈∆d,

−∇c(¯x)T(x−x) =¯ −(

d

X

i=1

a1/2i )21T(x−x) = 0.¯

That is,−∇c(¯x)∈Nd(¯x). By the optimality conditions for minima of convex functions (see Theo-rem 3.6.2), this implies thatx¯∈arg minx∈∆dc(x), which is equivalent to (6.20).

Corollary 6.4.5. Let ε > 0 and g ∈ (Rd)T for some T ∈ N. Moreover, set G˜T := εI + PT

t=1diag(gtgtT) and Sd:={X∈Sn++: Tr(X) = 1}. Then Tr( ˜G1/2T ) =

v u u t min

H∈SdDd

εTr(H−1) +

T

X

t=1

kgtk2H−1

! .

Proof. Set H := Sd∩Dd. Note that if g ∈ Rd and H ∈ Dd, then kgk2H = Tr(Diag(g)HDiag(g)).

Using this fact and Proposition 6.4.4, we have v

u u tmin

H∈H εTr(H−1) +

T

X

t=1

kgtk2H−1

!

= v u u tmin

H∈H εhH−1, Ii+

T

X

t=1

Tr(Diag(gt)H−1Diag(gt))

!

= v u u tmin

H∈H εhH−1, Ii+

T

X

t=1

hH−1,Diag(gtgtT)i

!

= r

Hmin∈HhH−1,G˜Ti= q

Tr2( ˜G1/2T )

= Tr( ˜G1/2T ).

With the above corollary, we can compare the regret bound for the Diagonal AdaGrad from Theorem 6.4.3 with the regret bounds for the classic AdaGrad algorithm (Theorem 6.3.2) and with the regret for the Online Mirror Descent algorithm with adaptive step size (Theorem 6.1.1). As expected, the regret bound for AdaGrad seems to be better than the one for its diagonal version.

We can see this by comparing Corollaries 6.3.4 and 6.4.5, which show more palatable ways of writing the traces that appear on the bounds of both algorithms. On Corollary 6.3.4, the minimum is taken over all positive definite matrices in the spectraplex, while in the minima in above corollary the search space is restricted to diagonal matrices. Still, the regret bound for the Diagonal AdaGrad seems to be as good as the one for the OMD algorithm with adaptive step sizes from Theorem 6.1.1.

To see this, recall from (6.13) that we know the (scaled)`2-norm can be written as a norm induced byd−1I, whered∈N\ {0} is the dimension of the problem. Thus, for a value ofε >0small enough in the above corollary, we conclude that the norm chosen by the above minimum is as good as the

`2-norm if the goal is to minimize the sum of the norms of the subgradients. However, one problem appears when trying to compare the regret bound for DiagAdaGrad with the bounds on previous sections: the diameterθ∈R++ in Theorem 6.4.3 is w.r.t. the `-norm, while in previous sections the`2-norm was used. Thus, more informative comparisons of the these regret bounds need more information on the setX ⊆Rdfrom where the player picks her points.