• Nenhum resultado encontrado

Corollary 6.2.6. Let C:= (X,F)be an OCO instance such that X is a nonempty closed set and such that each f ∈ F is a proper closed function which is subdifferentiable on X. Let ε >0and let Φ : Sd→(−∞,+∞]be a meta-regularizer. Let T ∈N, let ENEMY be an enemy oracle forC, and define

(x,f) := OCOC(AdaRegXΦ,ENEMY, T).

Moreover, letH1, HT+1∈Sd++ be as in the definition ofAdaRegXΦ(f). Finally, let GT ∈Sd++be as in the definition ofAdaRegXΦ(f)and suppose there isθ∈R++such thatθ≥ { kx−uk22 :x, u∈X}.

Then, for every u∈X and forx0:=x1, Regret(AdaRegXΦ,f, u)≤ θ

2Tr(HT−1+1) +1

2(hGT, HT+1i+ Φ(HT+1)−Φ(H1)).

Proof. Let u ∈ X, and, for every t ∈ {1, . . . , T + 1}, let Ht ∈ Sd++ be as in the definition of AdaRegXΦ(f) and define Dt:=Ht−1−[t >1]Ht−1−1. By Theorem 6.2.4, we have

Regret(AdaRegXΦ,f, u)≤ 1 2

T

X

t=0

ku−xtk2D

t+1+1 2 min

H∈Sd++

(hGT, Hi+ Φ(H)−Φ(H1)).

By definition we have that HT+1 attains the above minimum. Thus, it only remains to bound the term PT

t=0ku−xtk2D

t+1. Note that

T

X

t=0

ku−xtk2D

t+1=

T

X

t=0

(u−xt)TDt+1(u−xT)

Le. 6.2.5

T

X

t=0

ku−xtk22Tr(Dt+1)

≤θ

T

X

t=0

Tr(Dt+1) =θTr XT

t=0

Dt+1

=θTr(HT−1+1).

translating the intuition that all the experts are equal from the perspective of the player. However, note that with the strategy from Section 6.1 the step size at roundt∈[T] isΘ(√

t−1). That is, even though the amount of information revealed by the enemy about each expert is the same after each interval ofdrounds, the weights attributed to the experts at the end of each interval is not uniform, with their weights depending on the order in which they were penalized. Intuitively this may seem weird since the order of appearance should not matter much in a game against this enemy oracle.

The AdaGrad algorithm can be interpreted as trying to make the subgradients steps adaptive in a more nuanced fashion. Instead of only adapting the step size based on the norm of the subgradients, at round tthe algorithm skews the subgradient with a matrixHtbuilt from rank-one updates based on the subgradients of previous rounds, and then performs the subgradient step. Finally, we define a player oracle which formally implements the Adaptive Gradient algorithm on Algorithm 6.3.

Algorithm 6.3 Definition ofAdaGradX hf1, . . . , fTi Input:

(i) A closed convex set X⊆E,

(ii) Convex functions f1, . . . , fT ∈ F for some T ∈ N and F ⊆ (−∞,+∞]E such that ft is subdifferentiable onX for eacht∈[T],

(iii) Positive real numbersε >0 and η >0 (usually clear from the context).

Output: xT+1∈X G0 ←εI

{x1} ←arg minx∈Xkxk2 fort= 1 to T do

. Computations for roundt+ 1 Computegt∈∂ft(xt)

Gt←Gt−1+gtgTt xt+1←ΠG

1/2 t

X (xt−ηG−1/2t gt) return xT+1

As we have discussed, the AdaGrad algorithm can be seen as a generalization of the algorithm from Theorem 6.1.1 with the squared `2-norm as the mirror map. In Theorem 6.1.1 the algorithm performs, at a round t ∈ N\ {0}, a subgradient step with step size O Pt−1

i=1kgik22−1/2

, where g1, . . . , gt−1 ∈Rd are the subgradients (from the enemy’s functions) used by the player oracle on past rounds. The AdaGrad algorithm, on the other hand, performs at round t∈N\ {0} a step in the direction of the subgradientskewed by a matrixG−1/2t−1 , whereGt−1∈Sd++is a matrix built from rank-one updates based on the subgradientsg1, . . . , gt−1 ∈Rd of the enemy’s past functions (plus a small multiple of the identity to ensure thatGt−1 is invertible). Additionally, the projection onto the set X⊆Rd from where the player can pick its points performed by AdaGrad is skewed by the matrix G1/2t .

To derive a regret bound for the above algorithm, we will show that the AdaGrad algorithm is equivalent to the AdaRegalgorithm with a special (and simple) choice of meta-regularizer. An instructive way to define our meta-regularizerΦ is to define its behavior on the eigenvalues of the matrices10 given to Φas input, and then to see which known matrix operation this yields (if any).

In this way, the analysis of the algorithm is greatly simplified since, for convex functions applied to eigenvalues, we have tools to compute subgradients and, thus, check optimality conditions, as we

10Namely, given a matrixXSdwe will apply asymmetricconvex functionf:Rd(−∞,+∞]onλ(X)and then build a new matrixX0Sdfromf(λ(X)). For details, see Section 3.7.

have seen on Section 3.7.

On the next lemma we look at the form and at some properties of the meta-regularizer which yieldsAdaGrad.

Lemma 6.3.1. Letη >0, definef:Rd→(−∞,+∞]by f(x) :=η2

d

X

i=1

[xi>0]1

xi +δ(x|Rd++), ∀x∈Rd, and setΦ :=fS. Then f is a proper closed convex function,

Φ(H) =η2Tr(H−1), and ∇Φ(H) =−η2H−2 ∀H∈Sd++. (6.11) Additionally, for every G ∈ Sd++ the infimum infH

Sd++(hG, Hi+ Φ(H)) is attained by ηG−1/2. Moreover, Φ is a meta-regularizer and for every ε > 0 and for a certain well-order over the sets used by the oraclesAdaGradX andAdaRegXΦ we haveAdaGradX = AdaRegXΦ for every nonempty closed and convex setX ⊆Rd.

Proof. First, let us verify thef is a proper closed convex function. First of all, it is clear that f. Defineφ(α) := [α >0]α−1+δ(α|R++). Sinceφ(α)00= 2α−3 >0for everyα ∈R++, by Lemma 3.1.1 we conclude that φis convex. Sincef =Pd

i=1φ(xi), we conclude that f is convex. Finally, since limα→0φ(α) = +∞ andφis positive throughout R, we conclude that lim infx→¯xf(x) = +∞=f(¯x) for any x¯∈Rd+\Rd++. Therefore,f is closed.

Let H ∈Sd++, setλ:=λ(H), and set Λ := Diag(λ). Let us first show that (6.11) holds. By the Spectral Decomposition Theorem (Theorem 1.1.1), there is an orthogonal matrixQ∈Rd×d such thatH =QΛQT. SinceH 0, by Theorem 1.1.3 we know that λ >0. Hence, Λ is invertible with (Λ−1)i,j = [i=j]λ−1i for everyi, j ∈[d]. Hence,

HQΛ−1QT=QΛQT−1QT=QΛΛ−1QT=QQT=I, that is,QΛ−1QT=H−1. Finally, we have

Φ(H) =f(λ) =η2

d

X

i=1

λ−1i2Tr(Λ−1) =η2Tr(QT−1) =η2Tr(QΛ−1QT) =η2Tr(H−1).

Moreover, note that∇f(λ)i =−η2λ−2i for every i∈ [d]. Hence, Diag(∇f(λ)) = −η2Λ−2, and by Corollary 3.7.5 we have

∇Φ(H) =QDiag(∇f(λ))QT=−η2−2QT=−η2(QΛ−1QT)2 =−η2H−2. This proves (6.11). Let G∈Sd++. Let us now show that

{ηG−1/2}= arg min

HSd++

(hG, Hi+ Φ(H)). (6.12)

Let Hˆ ∈ Sd++. Since domf = Rd, we have ri(dom Φ) = Sd, and since Sd++ is an open set with nonempty interior, then NSd

++( ˆH) = {0}. Therefore,Sd++∩ri(dom Φ)is nonempty, and by Theo-rem 3.6.2 we have

Hˆ ∈arg min

H∈Sd++

(hG, Hi+ Φ(H)) ⇐⇒ G+∇Φ( ˆH) = 0 (6.11)⇐⇒ η2−2=G ⇐⇒ ( ˆH−1)2 = 1 η2G

Prop. 1.1.4

⇐⇒ Hˆ−1 = 1

ηG1/2 ⇐⇒ Hˆ =ηG−1/2.

This finishes the proof of (6.12).

Now let us show that Φ is a meta-regularizer. Let T ∈Nandg ∈(Rd)T. Moreover, letε >0 and set GT−1:=εI+PT−1

t=1 gtgTt andGT :=GT−1+gTgTT. Condition (6.3.i) of a meta-regularizer is satisfied byΦ since, by (6.12), we know thatinfHSd

++(hH, GTi+ Φ(H))is attained by ηG−1/2T . Thus, set HT+1 :=ηG−1/2T and HT :=ηG−1/2T−1. Note that

HT−1+1−HT−1 = 1η(G1/2T −G1/2T−1).

Sinceη >0and GT −GT−1 =gTgTT 0, by Lemma 1.1.5 we have that 1η(G1/2T −G1/2T−1)0. That is, Φsatisfies condition (6.3.ii), which completes the proof that Φis a meta-regularizer.

Last but not least, let us show that AdaGradX = AdaRegXΦ for any nonempty closed and convex set X ⊆Rd and any ε >0 (recall that we already have η > 0 from the statement of the lemma). Let X ⊆X ⊆ Rd be a nonempty closed and convex set and let ε > 0. Moreover, Let f :=hf1, . . . , fTi ∈Seq((−∞,+∞]Rd)be such thatftis subdifferentiable onXfor everyt∈[T]. Let us show thatAdaGradX(f1:t−1) = AdaRegXΦ(f1:t−1)by induction ont∈[T]. Setx1 := AdaRegXΦ(hi) and letH1 ∈Sd++ be as in the definition ofAdaRegXΦ(hi). By (6.12), we know that H1 = (η/√

ε)I. Thus,

x1∈arg min

x∈X

kxkH−1

1 = arg min

x∈X

xTH1−1x= arg min

x∈X

ε

η xTx= arg min

x∈X

kxk22 = arg min

x∈X

kxk2. Since the squared `2-norm is strongly convex (by Lemma 3.9.5), we have that x1 is the unique point that attains the above minima. Thus, x1 = AdaGradX(hi). Let t ∈ {2, . . . , T + 1}, and let gt−1 ∈Rd and Gt−1 ∈Sd++ be as in the definition ofxt := AdaRegXΦ(f1:t−1) (which are equal togt−1 and Gt−1 in the definition ofAdaGradXΦ(f1:t−1) with a proper choice of well-order on the subdifferentials used). Finally, let Ht ∈ Sd++ be as in the definition of AdaRegXΦ(f1:t−1) and set xt−1:= AdaRegXΦ(f1:t−2) = AdaGradX(f1:t−2). Then,

xt= ΠH

−1 t

X (xt−1−Htgt−1)(6.12)= ΠG

1/2 t−1

X (xt−1−ηG−1/2t−1 gt−1) = AdaGradX(f1:t−1).

Now that we know which meta-regularizer to use to write AdaGradasAdaReg, we can apply the results from Section 6.2 to obtain regret bounds forAdaGrad.

Theorem 6.3.2. LetC:= (X,F)be an OCO instance such that X⊆Rdis a nonempty closed set and such that each f ∈ F is a proper closed function which is subdifferentiable on X. Let11 ε >0, let T ∈N, let ENEMYbe an enemy oracle for C, and define

(x,f) := OCOC(AdaGradX,ENEMY, T).

Moreover, let GT ∈Sd++ be as in the definition ofAdaGradX(f), suppose there is θ ∈R++ such thatθ≥sup{ ku−xk22:u, x∈X}, and setη :=p

θ/2. Then, Regret(AdaGradX,f, X)≤√

2θTr(G1/2T ).

Proof. Definef:Rd→(−∞,+∞]byf(x) :=η2Pd

i=1[xi 6= 0]x1

i for each x∈Rd, and set Φ :=fS. By Lemma 6.3.1, we have Φ(H) = η2Tr(H−1) for any H ∈ Sd++ and AdaRegXΦ = AdaGradX.

11Thisεis the needed to defineAdaGrad, although it does not appear in the regret bound.

Thus, we only need to bound the regret ofAdaRegXΦ. LetH1, HT+1∈Sd++ be as in the definition of AdaRegXΦ(f). By Lemma 6.3.1 together with the definitions of H1 andHT+1 we have

H1=η(εI)−1/2 = η

√εI and HT+1 =ηG−1/2T . Thus, by Corollary 6.2.6 we have, for every u∈X,

Regret(AdaRegXΦ,f, u)≤ θ

2Tr(HT−1+1) +1

2(hGT, HT+1i+ Φ(HT+1)−Φ(H1))

= θ

2ηTr(G1/2T ) +1

2(ηTr(G1/2T ) + Φ(ηG−1/2T )−Φ(ηεI))

= θ

2ηTr(G1/2T ) +1

2(ηTr(G1/2T ) +ηTr(G1/2T )−η√

εTr(I))

≤ θ

2ηTr(G1/2T ) +ηTr(G1/2T )

= rθ

2Tr(G1/2T ) + rθ

2Tr(G1/2T ) =√

2θTr(G1/2T ).

One may have noticed that the value of εis free to be as small as we want in the above result.

This suggests that this parameter may not be needed after all. However, ifε= 0, then the matrices Gt in the definition ofAdaGradare not necessarily invertible anymore. To solve this one could use the Moore-Penrose pseudo-inverse instead of the inverse of the matrices. For the sake of brevity, we have chosen to describe only the case where all matrices are invertible (that is, the case forε >0).

One problem with the regret bound from Theorem 6.3.2 is that it is hard to interpret how good it is. The intuitive meaning ofTr(G1/2T ) is not clear, whereGT ∈Sd++is defined as in Theorem 6.3.2.

We know thatTr(GT) is the sum of the squared`2-norms of the subgradients (plusεd, whered∈N is the dimension of the problem), but interpreting Tr(G1/2) is way harder. The next proposition sheds some light about the meaning of the above regret bound and shows how it may be as good as the one from Section 6.1, for example.

Proposition 6.3.3 ([31, Lemma 15]). Let A∈Sd++. Then inf{Tr(X−1A) : X∈Sd++,Tr(X) = 1}

is attained byA1/2/Tr(A1/2).

With the above proposition, we have the following corollary which makes the regret bound for AdaGrad way more palatable.

Corollary 6.3.4. Letε >0 andg ∈(Rd)T for some T ∈N. Moreover, setGT :=εI+PT t=1gtgtT and Sd:={X∈Sd+: Tr(X) = 1}. Then

Tr(G1/2T ) = v u u t min

H∈SdSd++

εTr(H−1) +

T

X

t=1

kgtk2H−1

! . Proof. SetS++:=Sd∩Sd++. By Proposition 6.3.3, we have

v u u t min

H∈S++

εTr(H−1) +

T

X

t=1

kgtk2

H−1

!

= v u u t min

H∈S++

εhI, H−1i+

T

X

t=1

hgtgTt, H−1i

!

= r

H∈Smin++

hGT, H−1i= q

Tr(G1/2T )2

= Tr(G1/2T ).

That is, the trace in the regret of AdaGradhas value close12 to the square-root of sum of the norms of the gradients, where the norm is the best among all norms induced by matricesH−1 with H∈Sd++ in the spectraplex Sd, that is, such that Tr(H) = 1. Note that, for anyg∈Rd, by setting H¯ :=d−1H∈ Sdwe have

kgk2H¯ = 1dgTIg= 1dkgk22 and Tr( ¯H−1) =dTr(I) =d2. (6.13) Thus, forε >0 small enough (namely, smaller thand−2), we conclude that the regret bound from Theorem 6.3.2 as good as the one from Theorem 6.1.1.