The AdaGrad Algorithm - Online Convex Optimization: Algorithms, Learning, and Duality

Corollary 6.2.6. Let C:= (X,F)be an OCO instance such that X is a nonempty closed set and such that each f ∈ F is a proper closed function which is subdifferentiable on X. Let ε >0and let Φ : S^d→(−∞,+∞]be a meta-regularizer. Let T ∈N, let ENEMY be an enemy oracle forC, and define

(x,f) := OCOC(AdaReg^X_Φ,ENEMY, T).

Moreover, letH1, HT+1∈S^d₊₊ be as in the definition ofAdaReg^X_Φ(f). Finally, let GT ∈S^d₊₊be as in the definition ofAdaReg^X_Φ(f)and suppose there isθ∈R++such thatθ≥ { kx−uk²₂ :x, u∈X}.

Then, for every u∈X and forx₀:=x₁, Regret(AdaReg^X_Φ,f, u)≤ θ

2Tr(H_T⁻¹₊₁) +1

2(hG_T, HT+1i+ Φ(HT+1)−Φ(H1)).

Proof. Let u ∈ X, and, for every t ∈ {1, . . . , T + 1}, let H_t ∈ S^d++ be as in the definition of AdaReg^X_Φ(f) and define Dt:=H_t⁻¹−[t >1]H_t−1⁻¹. By Theorem 6.2.4, we have

Regret(AdaReg^X_Φ,f, u)≤ 1 2

t=0

ku−xtk²_D

t+1+1 2 min

H∈S^d++

(hG_T, Hi+ Φ(H)−Φ(H1)).

By definition we have that H_T₊₁ attains the above minimum. Thus, it only remains to bound the term PT

t=0ku−x_tk²_D

t+1. Note that

t=0

ku−x_tk²_D

t+1=

t=0

(u−x_t)^TD_t+1(u−x_T)

Le. 6.2.5

≤

t=0

ku−xtk²₂Tr(Dt+1)

≤θ

t=0

Tr(Dt+1) =θTr X^T

t=0

Dt+1

=θTr(H_T⁻¹₊₁).

translating the intuition that all the experts are equal from the perspective of the player. However, note that with the strategy from Section 6.1 the step size at roundt∈[T] isΘ(√

t⁻¹). That is, even though the amount of information revealed by the enemy about each expert is the same after each interval ofdrounds, the weights attributed to the experts at the end of each interval is not uniform, with their weights depending on the order in which they were penalized. Intuitively this may seem weird since the order of appearance should not matter much in a game against this enemy oracle.

The AdaGrad algorithm can be interpreted as trying to make the subgradients steps adaptive in a more nuanced fashion. Instead of only adapting the step size based on the norm of the subgradients, at round tthe algorithm skews the subgradient with a matrixHtbuilt from rank-one updates based on the subgradients of previous rounds, and then performs the subgradient step. Finally, we define a player oracle which formally implements the Adaptive Gradient algorithm on Algorithm 6.3.

Algorithm 6.3 Definition ofAdaGrad^X hf₁, . . . , f_Ti Input:

(i) A closed convex set X⊆E,

(ii) Convex functions f1, . . . , fT ∈ F for some T ∈ N and F ⊆ (−∞,+∞]^E such that ft is subdifferentiable onX for eacht∈[T],

(iii) Positive real numbersε >0 and η >0 (usually clear from the context).

Output: xT+1∈X G0 ←εI

{x₁} ←arg min_x∈Xkxk₂ fort= 1 to T do

. Computations for roundt+ 1 Computeg_t∈∂f_t(x_t)

Gt←Gt−1+gtg^T_t x_t+1←Π^G

1/2 t

X (x_t−ηG^−1/2_t g_t) return xT+1

As we have discussed, the AdaGrad algorithm can be seen as a generalization of the algorithm from Theorem 6.1.1 with the squared `2-norm as the mirror map. In Theorem 6.1.1 the algorithm performs, at a round t ∈ N\ {0}, a subgradient step with step size O Pt−1

i=1kg_ik²₂−1/2

, where g₁, . . . , gt−1 ∈R^d are the subgradients (from the enemy’s functions) used by the player oracle on past rounds. The AdaGrad algorithm, on the other hand, performs at round t∈N\ {0} a step in the direction of the subgradientskewed by a matrixG^−1/2_t−1 , whereGt−1∈S^d₊₊is a matrix built from rank-one updates based on the subgradientsg1, . . . , gt−1 ∈R^d of the enemy’s past functions (plus a small multiple of the identity to ensure thatGt−1 is invertible). Additionally, the projection onto the set X⊆R^d from where the player can pick its points performed by AdaGrad is skewed by the matrix G^1/2_t .

To derive a regret bound for the above algorithm, we will show that the AdaGrad algorithm is equivalent to the AdaRegalgorithm with a special (and simple) choice of meta-regularizer. An instructive way to define our meta-regularizerΦ is to define its behavior on the eigenvalues of the matrices¹⁰ given to Φas input, and then to see which known matrix operation this yields (if any).

In this way, the analysis of the algorithm is greatly simplified since, for convex functions applied to eigenvalues, we have tools to compute subgradients and, thus, check optimality conditions, as we

10Namely, given a matrixX∈S^dwe will apply asymmetricconvex functionf:R^d→(−∞,+∞]onλ^↑(X)and then build a new matrixX⁰∈S^dfromf(λ^↑(X)). For details, see Section 3.7.

have seen on Section 3.7.

On the next lemma we look at the form and at some properties of the meta-regularizer which yieldsAdaGrad.

Lemma 6.3.1. Letη >0, definef:R^d→(−∞,+∞]by f(x) :=η²

i=1

[xi>0]1

x_i +δ(x|R^d++), ∀x∈R^d, and setΦ :=f_S. Then f is a proper closed convex function,

Φ(H) =η²Tr(H⁻¹), and ∇Φ(H) =−η²H⁻² ∀H∈S^d₊₊. (6.11) Additionally, for every G ∈ S^d++ the infimum inf_H_∈

S^d++(hG, Hi+ Φ(H)) is attained by ηG^−1/2. Moreover, Φ is a meta-regularizer and for every ε > 0 and for a certain well-order over the sets used by the oraclesAdaGrad^X andAdaReg^X_Φ we haveAdaGrad^X = AdaReg^X_Φ for every nonempty closed and convex setX ⊆R^d.

Proof. First, let us verify thef is a proper closed convex function. First of all, it is clear that f. Defineφ(α) := [α >0]α⁻¹+δ(α|R++). Sinceφ(α)⁰⁰= 2α⁻³ >0for everyα ∈R++, by Lemma 3.1.1 we conclude that φis convex. Sincef =Pd

i=1φ(xi), we conclude that f is convex. Finally, since limα→0φ(α) = +∞ andφis positive throughout R, we conclude that lim infx→¯xf(x) = +∞=f(¯x) for any x¯∈R^d+\R^d++. Therefore,f is closed.

Let H ∈S^d₊₊, setλ:=λ^↑(H), and set Λ := Diag(λ). Let us first show that (6.11) holds. By the Spectral Decomposition Theorem (Theorem 1.1.1), there is an orthogonal matrixQ∈R^d×d such thatH =QΛQ^T. SinceH 0, by Theorem 1.1.3 we know that λ >0. Hence, Λ is invertible with (Λ⁻¹)i,j = [i=j]λ⁻¹_i for everyi, j ∈[d]. Hence,

HQΛ⁻¹Q^T=QΛQ^TQΛ⁻¹Q^T=QΛΛ⁻¹Q^T=QQ^T=I, that is,QΛ⁻¹Q^T=H⁻¹. Finally, we have

Φ(H) =f(λ) =η²

i=1

λ⁻¹_i =η²Tr(Λ⁻¹) =η²Tr(Q^TQΛ⁻¹) =η²Tr(QΛ⁻¹Q^T) =η²Tr(H⁻¹).

Moreover, note that∇f(λ)i =−η²λ⁻²_i for every i∈ [d]. Hence, Diag(∇f(λ)) = −η²Λ⁻², and by Corollary 3.7.5 we have

∇Φ(H) =QDiag(∇f(λ))Q^T=−η²QΛ⁻²Q^T=−η²(QΛ⁻¹Q^T)² =−η²H⁻². This proves (6.11). Let G∈S^d++. Let us now show that

{ηG^−1/2}= arg min

H∈S^d++

(hG, Hi+ Φ(H)). (6.12)

Let Hˆ ∈ S^d++. Since domf = R^d, we have ri(dom Φ) = S^d, and since S^d++ is an open set with nonempty interior, then N_S^d

++( ˆH) = {0}. Therefore,S^d₊₊∩ri(dom Φ)is nonempty, and by Theo-rem 3.6.2 we have

Hˆ ∈arg min

H∈S^d++

(hG, Hi+ Φ(H)) ⇐⇒ G+∇Φ( ˆH) = 0 ^(6.11)⇐⇒ η²Hˆ⁻²=G ⇐⇒ ( ˆH⁻¹)² = 1 η²G

Prop. 1.1.4

⇐⇒ Hˆ⁻¹ = 1

ηG^1/2 ⇐⇒ Hˆ =ηG^−1/2.

This finishes the proof of (6.12).

Now let us show that Φ is a meta-regularizer. Let T ∈Nandg ∈(R^d)^T. Moreover, letε >0 and set GT−1:=εI+PT−1

t=1 gtg^T_t andGT :=GT−1+gTg^T_T. Condition (6.3.i) of a meta-regularizer is satisfied byΦ since, by (6.12), we know thatinf_H_∈_S^d

++(hH, G_Ti+ Φ(H))is attained by ηG^−1/2_T . Thus, set H_T₊₁ :=ηG^−1/2_T and H_T :=ηG^−1/2_T₋₁. Note that

H_T⁻¹₊₁−H_T⁻¹ = ¹_η(G^1/2_T −G^1/2_T₋₁).

Sinceη >0and GT −GT−1 =gTg^T_T 0, by Lemma 1.1.5 we have that ¹_η(G^1/2_T −G^1/2_T₋₁)0. That is, Φsatisfies condition (6.3.ii), which completes the proof that Φis a meta-regularizer.

Last but not least, let us show that AdaGrad^X = AdaReg^X_Φ for any nonempty closed and convex set X ⊆R^d and any ε >0 (recall that we already have η > 0 from the statement of the lemma). Let X ⊆X ⊆ R^d be a nonempty closed and convex set and let ε > 0. Moreover, Let f :=hf₁, . . . , f_Ti ∈Seq((−∞,+∞]^R^d)be such thatf_tis subdifferentiable onXfor everyt∈[T]. Let us show thatAdaGrad^X(f1:t−1) = AdaReg^X_Φ(f1:t−1)by induction ont∈[T]. Setx₁ := AdaReg^X_Φ(hi) and letH1 ∈S^d++ be as in the definition ofAdaReg^X_Φ(hi). By (6.12), we know that H1 = (η/√

ε)I. Thus,

x1∈arg min

x∈X

kxk_H−1

1 = arg min

x∈X

x^TH₁⁻¹x= arg min

x∈X

√ε

η x^Tx= arg min

x∈X

kxk²₂ = arg min

x∈X

kxk₂. Since the squared `₂-norm is strongly convex (by Lemma 3.9.5), we have that x₁ is the unique point that attains the above minima. Thus, x1 = AdaGrad^X(hi). Let t ∈ {2, . . . , T + 1}, and let gt−1 ∈R^d and Gt−1 ∈S^d++ be as in the definition ofx_t := AdaReg^X_Φ(f1:t−1) (which are equal togt−1 and Gt−1 in the definition ofAdaGrad^X_Φ(f1:t−1) with a proper choice of well-order on the subdifferentials used). Finally, let Ht ∈ S^d++ be as in the definition of AdaReg^X_Φ(f1:t−1) and set xt−1:= AdaReg^X_Φ(f1:t−2) = AdaGrad^X(f1:t−2). Then,

xt= Π^H

−1 t

X (xt−1−Htgt−1)^(6.12)= Π^G

1/2 t−1

X (xt−1−ηG^−1/2_t−1 gt−1) = AdaGrad^X(f1:t−1).

Now that we know which meta-regularizer to use to write AdaGradasAdaReg, we can apply the results from Section 6.2 to obtain regret bounds forAdaGrad.

Theorem 6.3.2. LetC:= (X,F)be an OCO instance such that X⊆R^dis a nonempty closed set and such that each f ∈ F is a proper closed function which is subdifferentiable on X. Let¹¹ ε >0, let T ∈N, let ENEMYbe an enemy oracle for C, and define

(x,f) := OCOC(AdaGrad^X,ENEMY, T).

Moreover, let G_T ∈S^d++ be as in the definition ofAdaGrad^X(f), suppose there is θ ∈R++ such thatθ≥sup{ ku−xk²₂:u, x∈X}, and setη :=p

θ/2. Then, Regret(AdaGrad^X,f, X)≤√

2θTr(G^1/2_T ).

Proof. Definef:R^d→(−∞,+∞]byf(x) :=η²Pd

i=1[xi 6= 0]_x¹

i for each x∈R^d, and set Φ :=f_S. By Lemma 6.3.1, we have Φ(H) = η²Tr(H⁻¹) for any H ∈ S^d++ and AdaReg^X_Φ = AdaGrad^X.

11Thisεis the needed to defineAdaGrad, although it does not appear in the regret bound.

Thus, we only need to bound the regret ofAdaReg^X_Φ. LetH₁, H_T₊₁∈S^d++ be as in the definition of AdaReg^X_Φ(f). By Lemma 6.3.1 together with the definitions of H₁ andH_T₊₁ we have

H1=η(εI)^−1/2 = η

√εI and HT+1 =ηG^−1/2_T . Thus, by Corollary 6.2.6 we have, for every u∈X,

Regret(AdaReg^X_Φ,f, u)≤ θ

2Tr(H_T⁻¹₊₁) +1

2(hG_T, H_T₊₁i+ Φ(H_T₊₁)−Φ(H₁))

= θ

2ηTr(G^1/2_T ) +1

2(ηTr(G^1/2_T ) + Φ(ηG^−1/2_T )−Φ(^√^η_εI))

= θ

2ηTr(G^1/2_T ) +1

2(ηTr(G^1/2_T ) +ηTr(G^1/2_T )−η√

εTr(I))

≤ θ

2ηTr(G^1/2_T ) +ηTr(G^1/2_T )

= rθ

2Tr(G^1/2_T ) + rθ

2Tr(G^1/2_T ) =√

2θTr(G^1/2_T ).

One may have noticed that the value of εis free to be as small as we want in the above result.

This suggests that this parameter may not be needed after all. However, ifε= 0, then the matrices G_t in the definition ofAdaGradare not necessarily invertible anymore. To solve this one could use the Moore-Penrose pseudo-inverse instead of the inverse of the matrices. For the sake of brevity, we have chosen to describe only the case where all matrices are invertible (that is, the case forε >0).

One problem with the regret bound from Theorem 6.3.2 is that it is hard to interpret how good it is. The intuitive meaning ofTr(G^1/2_T ) is not clear, whereG_T ∈S^d++is defined as in Theorem 6.3.2.

We know thatTr(GT) is the sum of the squared`2-norms of the subgradients (plusεd, whered∈N is the dimension of the problem), but interpreting Tr(G^1/2) is way harder. The next proposition sheds some light about the meaning of the above regret bound and shows how it may be as good as the one from Section 6.1, for example.

Proposition 6.3.3 ([31, Lemma 15]). Let A∈S^d₊₊. Then inf{Tr(X⁻¹A) : X∈S^d₊₊,Tr(X) = 1}

is attained byA^1/2/Tr(A^1/2).

With the above proposition, we have the following corollary which makes the regret bound for AdaGrad way more palatable.

Corollary 6.3.4. Letε >0 andg ∈(R^d)^T for some T ∈N. Moreover, setGT :=εI+PT t=1gtg_t^T and S_d:={X∈S^d+: Tr(X) = 1}. Then

Tr(G^1/2_T ) = v u u t min

H∈S_d∩S^d++

εTr(H⁻¹) +

t=1

kg_tk²_H−1

! . Proof. SetS₊₊:=S_d∩S^d₊₊. By Proposition 6.3.3, we have

v u u t min

H∈S++

εTr(H⁻¹) +

t=1

kg_tk²

H⁻¹

= v u u t min

H∈S++

εhI, H⁻¹i+

t=1

hg_tg^T_t, H⁻¹i

= r

H∈Smin++

hG_T, H⁻¹i= q

Tr(G^1/2_T )2

= Tr(G^1/2_T ).

That is, the trace in the regret of AdaGradhas value close¹² to the square-root of sum of the norms of the gradients, where the norm is the best among all norms induced by matricesH⁻¹ with H∈S^d₊₊ in the spectraplex S_d, that is, such that Tr(H) = 1. Note that, for anyg∈R^d, by setting H¯ :=d⁻¹H∈ S_dwe have

kgk²_H_¯ = ¹_dg^TIg= ¹_dkgk²₂ and Tr( ¯H⁻¹) =dTr(I) =d². (6.13) Thus, forε >0 small enough (namely, smaller thand⁻²), we conclude that the regret bound from Theorem 6.3.2 as good as the one from Theorem 6.1.1.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 153-158)