Diagonal AdaGrad Algorithm - Online Convex Optimization: Algorithms, Learning, and Duality

That is, the trace in the regret of AdaGradhas value close¹² to the square-root of sum of the norms of the gradients, where the norm is the best among all norms induced by matricesH⁻¹ with H∈S^d₊₊ in the spectraplex S_d, that is, such that Tr(H) = 1. Note that, for anyg∈R^d, by setting H¯ :=d⁻¹H∈ S_dwe have

kgk²_H_¯ = ¹_dg^TIg= ¹_dkgk²₂ and Tr( ¯H⁻¹) =dTr(I) =d². (6.13) Thus, forε >0 small enough (namely, smaller thand⁻²), we conclude that the regret bound from Theorem 6.3.2 as good as the one from Theorem 6.1.1.

Diagonal Adaptive Gradient algorithm. Not only that, this allows us to use many of the results from the previous section as stepping stones to derive the results for this section.

Throughout the remainder of this section, we denote by D^d := {Diag(x)∈S^d: x∈R^d} the set of d×d diagonal matrices. Moreover, we overload the Diag operator and for every A ∈ S^d defineDiag(A) := Diag(diag(A)). That is, Diag(A) is equal to the matrixA but with zeroes on its off-diagonal entries. Let us show the form and some properties of the meta-regularizer we will use to derive the DiagAdaGrad algorithm, proving first a simple lemma about the normal cone and the relative interior of D^d.

Lemma 6.4.1. We have ri(D^d) =D^d and

N_Dd( ˜A) ={A∈S^d: diag(A) = 0}, ∀A˜∈D.

Proof. First, note that for anyA,˜ B˜ ∈D^d, we have(1−µ) ˜A+µB˜ ∈Dfor anyµ∈R. Thus,D^d is an affine set andri(D^d) =D^d.

Let A˜ ∈D^d, let A ∈ S^d, and define ¯a:= diag( ˜A). Note that hA, X −Ai ≤˜ 0 for any X ∈ D^d if and only if hA,Diag(x−¯a)i ≤ 0 for any x ∈ R^d. Moreover, hA,Diag(x−a)i ≤¯ 0 for every x ∈ R^d if and only if diag(A)^T(x−¯a) ≤ 0 for every x ∈ R^d. That is, A ∈ N_Dd( ˜A) if and only if diag(A)∈N_Rd(¯a) ={0}.

Lemma 6.4.2. Letη >0, definef:R^d→(−∞,+∞]by f(x) :=η²

i=1

[x_i6= 0]1 xi

, ∀x∈R^d, and setΦ :=f_S+δ(· |D^d). Then,

Φ(H) =η²Tr(H⁻¹) +δ(H|D^d), ∀H∈S^d₊₊, (6.14) and for every G∈S^d₊₊ the infimum inf_H_∈

S^d++(hG, Hi+ Φ(H))is attained by η(Diag(G))^−1/2. In particular, for everyε >0 we haveDiagAdaGrad^X = AdaReg^X_Φ for every nonempty and closed set X⊆R^d.

Proof. Note that (6.14) follows directly from Lemma 6.3.1. LetG∈S^d₊₊. Let us show {η(Diag(G))^−1/2}= arg min

H∈S^d++

(hG, Hi+ Φ(H)) (6.15)

Since Φis proper and infinite outside of D^d, it is clear that the infimum can only be attained by a matrix inD^d∩S^d++. LetH¯ ∈D^d∩S^d++ and define Φ_AdaGrad(H) :=η²Tr(H⁻¹) for every H ∈S^d++. Note thatΦ = Φ_AdaGrad+δ(· |D^d) in this case. By Lemma 6.4.1 we have ri(D^d) = D^d. Thus, we have (riS^d₊₊)∩(ri(D^d)) =S^d₊₊∩D^d6=∅. Thus, formula for the subdifferential of the sum of convex functions from Theorem 3.5.4 together with the differentiability of Φ_AdaGrad from Lemma 6.3.1,

∂Φ( ¯H) =∇Φ_AdaGrad( ¯H) +N_D^d( ¯H) =−η²H¯⁻²+N_D^d( ¯H).

Thus, by the optimality conditions from Theorem 3.6.2, H¯ attains the infimum in (6.15) if and only if

−(G_T +∂Φ( ¯H))∩N

S^d++( ¯H)6=∅ ⇐⇒ 0∈(G_T +∂Φ( ¯H)) =G_T −η²H¯⁻²+N_Dd( ¯H).

The above holds if and only if there is A∈N_Dd( ¯H) such thatη²H¯⁻² =G+A. SinceH¯ ∈D^d, we have η²H¯⁻² ∈D^d and, therefore,G+A∈D^d. Hence, we haveG+A= Diag(G+A)andH¯ attains the infimum in (6.15) if and only if

H¯⁻²= 1

η²(G+A) = 1

η² Diag(G+A) = 1

η²Diag(G),

where in the last equation we have used that diag(A) = 0 by Lemma 6.4.1. Thus, H¯ attains the infimum from (6.15) if and only ifH¯ = ¹_ηDiag(G)^−1/2, which proves (6.15).

Now let us show that Φ is a meta-regularizer. Let T ∈ N\ {0} and g ∈ (R^d)^T. Moreover, let ε >0 and set G_T−1 :=εI+PT−1

t=1 g_tg_t^T andG_T :=G_T−1+g_Tg_T^T. Condition (6.3.i) is satisfied by Φsince by (6.12) we know thatinf_H_∈_Sd

++(hH, G_Ti+ Φ(H))is attained byηDiag(GT)^−1/2. Set H_T₊₁:=ηDiag(G_T)^−1/2 andH_T :=ηDiag(G_T−1)^−1/2. By definition,

H_T⁻¹₊₁−H_T⁻¹ = ¹_η(Diag(GT)^1/2−Diag(GT−1)^1/2).

SinceG_T andG_T−1are positive semidefinite, we have thatdiag(G_T)anddiag(G_T−1)are non-negative.

Moreover, since GT −GT−1 =gTg^T_T 0, for every i∈[d]we

(G_T)_i,i=e^T_i G_Te_i ≥e^T_iG_T−1e_i= (G_T−1)_i,i =⇒ (G_T)_i,i ≥(G_T−1)_i,i =⇒ (G_T)^1/2_i,i ≥(G_T−1)^1/2_i,i . Sinceη >0, we conclude that ¹_ηdiag(G_T)^1/2≥ _η¹diag(G_T−1)^1/2, which proves that ¹_η(Diag(G_T)^1/2− Diag(GT−1)^1/2) is positive semidefinite, that is,Φsatisfies condition (6.3.ii). This finishes the proof thatΦ is a meta-regularizer.

Last but not least, let us show thatDiagAdaGrad^X = AdaReg^X_Φ for anyε >0and any closed and convex set∅6=X⊆R^d(recall thatη >0is already given by the statement). LetX⊆R^dbe a closed convex and nonempty set, let ε >0, and let f :=hf₁, . . . , f_Ti ∈Seq((−∞,+∞]^R^d) be such that the functionft is subdifferentiable on X for everyt∈[T]. Let us show thatDiagAdaGrad^X(f1:t−1) = AdaReg^X_Φ(f1:t−1)by induction ont∈[T]. Setx₁ := AdaReg^X_Φ(hi)and letH₁ be as in the definition of AdaReg^X_Φ(hi). By (6.15), we know that H₁= (η/√

ε)I. Thus, x₁∈arg min

x∈X

kxk_H−1

1 = arg min

x∈X

x^TH₁⁻¹x= arg min

x∈X

√ε

η x^Tx= arg min

x∈X

kxk₂.

Since k·k²₂ is strictly convex (see Lemma 3.9.5), the above minimizer is unique and, thus, we havex₁= DiagAdaGrad^X(hi). Let t∈ {2, . . . , T + 1}, and let gt−1∈R^d andGt−1 ∈S^d₊₊ be as in the definition ofxt:= AdaReg^X_Φ(f1:t−1). One may note thatg1, . . . , gt−1∈R^d as in the definition of DiagAdaGrad^X_Φ(f1:t−1)matches g₁, . . . , gt−1 as in the definition of AdaReg^X_Φ(f1:t−1) with a proper choice of well-order on the subdifferentials used by the oracles. In this case, by defining G˜t−1 :=

εI+Pt−1

i=1gig_i^Tas in the definition ofDiagAdaGrad(f1:t−1), we haveG˜t−1 = Diag(Gt−1). Finally, let H_t be as in the definition ofAdaReg^X_Φ(f1:t−1), set xt−1 := AdaReg^X_Φ(f1:t−2) = AdaGrad^X(f1:t−2) (where the equation holds by induction), and define G˜:= Diag(Gt−1) = ˜Gt−1. Then,

xt= Π^H

−1 t

X (xt−1−Htgt−1)^(6.12)= Π^G_X^˜^−1/2(xt−1−ηG˜^1/2gt−1) = AdaGrad^X(f1:t−1).

Finally, we are in place to prove a regret bound for the Diagonal AdaGrad algorithm.

Theorem 6.4.3. LetC:= (X,F) be an OCO instance such that X is a nonempty closed set and such that each f ∈ F is a proper closed functions which is subdifferentiable on X. Let ε > 0,

let T ∈ N, and let ENEMY be an enemy oracle for C. Suppose there is θ ∈ R++ such that θ≥sup{ ku−xk²_∞:u, x∈X} and setη:=p

θ/2for DiagAdaGrad. Finally, define (x,f) := OCOC(DiagAdaGrad^X,ENEMY, T)

and letG˜T ∈S^d++ be as in the definition ofDiagAdaGrad^X(f). Then Regret(AdaReg^X_Φ,f, X)≤√

2θTr( ˜G^1/2_T ).

Proof. Define f: R^d → (−∞,+∞] by f(x) := η²Pd

i=1[x_i 6= 0]_x¹

i for each x ∈ R^d and set Φ :=

f_S+δ(· |D^d). By Lemma 6.4.2, we have Φ(H) =η²Tr(H⁻¹) +δ(H|D^d) for every H ∈S^d₊₊ and AdaReg^X_Φ = AdaGrad^X if a proper well-order is equipped to the sets used by AdaReg^X_Φ. In this case, we only to bound the regret ofAdaReg^X_Φ. For eacht∈ {1, . . . , T+ 1}, let H_t, Gt−1 ∈S^d++ be as in the definition ofAdaReg^X_Φ(f), set G˜t−1 := Diag(Gt−1) (which matches the definition ofG˜t−1

onDiagAdaGrad^X), define D_t:=H_t⁻¹−[t >1]H_t−1⁻¹, and let u∈R^d. Thus, by Theorem 6.2.4 with x0 :=x1 we have

Regret(AdaReg^X_Φ,f, u)≤ 1 2

t=0

ku−xtk²_D

t+1+1 2 min

H∈S^d++

(hG_T, Hi+ Φ(H)−Φ(H1)). (6.16) Let us bound each of the above terms separately. First, let us show that

t=0

ku−x_tk²_D

t+1 ≤√

2θTr( ˜G^1/2_T ). (6.17) By the definition of the matricesH₁, . . . , H_T₊₁ and by Lemma 6.4.2 we have H_t=ηG˜^−1/2_t−1 for every t∈ {1, . . . , T + 1}. Thus, Ht andDt are diagonal matrices with positive diagonal entries (the latter holds sinceGt−10). Thus, for every t∈ {1, . . . , T + 1} and anyv∈R^d,

v^TDt+1v=

i=1

v_i²(Dt+1)i,i ≤ kvk²_∞

i=1

(Dt+1)i,i=kvk²_∞Tr(Dt+1). (6.18) Therefore, using, among other facts, that H_T₊₁=ηG˜^−11/2_T by Lemma 6.4.2, we have

t=0

ku−xtk²_D

t+1=

t=0

(u−xt)^TDt+1(u−xT)

(6.18)

≤

t=0

ku−x_tk²_∞Tr(D_t+1)

≤θ

t=0

Tr(D_t+1) =θTrX^T

t=0

D_t+1

=θTr(H_T⁻¹₊₁)^{Le. 6.4.2}= θ

ηTr( ˜G^1/2_T )

√

2θTr( ˜G^1/2_T ).

This proves (6.17). Let us now show min

H∈S^d++

(hG˜_T, Hi+ Φ(H)−Φ(H₁))≤√

2θTr( ˜G^1/2_T ). (6.19)

By Lemma 6.4.2, we have that the above minimum is attained byηG˜^−1/2_T and H₁ = (η/√

ε)I since, by definition,H1 ∈arg min_H_∈_S^d

++(hG˜0, Hi+ Φ(H))and G˜0 =εI. Therefore, min

H∈S^d++

(hG˜_T, Hi+ Φ(H)−Φ(H₁)) =ηTr( ˜G^1/2_T ) + Φ(ηG˜^−1/2_T )−Φ(^√^η_εI)

=ηTr( ˜G^1/2_T ) +ηTr( ˜G^1/2_T )−η√ εTr(I)

≤2ηTr( ˜G^1/2_T ) =√

2θTr( ˜G^1/2_T ),

which proves (6.19). Plugging (6.17) and (6.19) into the regret bound from (6.16) completes the proof of the statement.

Again, one may find it hard to find any meaning on the trace in the regret bound in the above theorem. Since the matrices used by DiagAdaGrad are diagonal, there is a simpler formula for the trace. Namely, let ε >0, let g ∈ (R^d)^T for someT ∈N, and define G˜_T := εI+P_T

t=1diag(g_tg_t^T).

Then, one may verify that

Tr( ˜G^1/2_T ) =

i=1

v u u tε+

t=1

g_t(i)².

Still, the above formula may not be very informative. Let us prove a proposition similar to Proposition 6.4.4 which sheds some light into the meaning of the above trace.

Proposition 6.4.4. Let A ∈ S^d₊₊∩D^d. Then inf{Tr(X⁻¹A) : X ∈S^d₊₊,Tr(X) = 1} is attained by Tr(A^1/2)⁻¹A^1/2.

Proof. Definea:= diag(A). Since A∈D^d, we have A= Diag(a). Additionally, note that inf{ hX⁻¹, Ai:X∈S^d₊₊∩D^d,Tr(X) = 1}= inf{ hDiag(x)⁻¹,Diag(a)i:x∈R^d₊₊∩∆_d}

= inf ( _d

i=1

x_i :x∈R^d++∩∆_d )

Not only that, we also have that x¯ ∈ R^d++∩∆_d attains the last infimum above if and only if X:= Diag(x)⁻¹ attains the first infimum above. Define x¯∈∆_dby

xi := a^1/2_i Pd

i=1a^1/2_i

, ∀i∈[d].

Note that

Diag(¯x) = 1 Pd

i=1a^1/2_i Diag(a)^1/2= 1

Tr(Diag(a)^1/2)Diag(a)^1/2 = 1

Tr(A^1/2)A^1/2. Thus, to prove the statement, it suffices to show that

x∈arg min ( _d

i=1

a_i xi

:x∈R^d₊₊∩∆_d )

. (6.20)

Define the convex functionc:R^d→(−∞,+∞]by c(x) =

i=1

[xi >0]ai

+δ(x|R^d₊₊), ∀x∈R^d.

First of all, note that cis closed. Indeed,c is continuous onR^d++ and, for everyx¯∈R^d+\R^d++, lim inf

x→¯x c(x) = +∞=c(¯x).

Moreover, note that

(∇c(x))_i=−ai

x²_i, ∀i∈[d],∀x∈R^d₊₊. Thus, for every x∈∆_d,

−∇c(¯x)^T(x−x) =¯ −(

i=1

a^1/2_i )²1^T(x−x) = 0.¯

That is,−∇c(¯x)∈N_∆_d(¯x). By the optimality conditions for minima of convex functions (see Theo-rem 3.6.2), this implies thatx¯∈arg min_x∈∆_dc(x), which is equivalent to (6.20).

Corollary 6.4.5. Let ε > 0 and g ∈ (R^d)^T for some T ∈ N. Moreover, set G˜_T := εI + PT

t=1diag(gtg_t^T) and S_d:={X∈Sⁿ₊₊: Tr(X) = 1}. Then Tr( ˜G^1/2_T ) =

v u u t min

H∈S_d∩D^d

εTr(H⁻¹) +

t=1

kg_tk²_H₋₁

! .

Proof. Set H := S_d∩D^d. Note that if g ∈ R^d and H ∈ D^d, then kgk²_H = Tr(Diag(g)HDiag(g)).

Using this fact and Proposition 6.4.4, we have v

u u tmin

H∈H εTr(H⁻¹) +

t=1

kg_tk²_H−1

= v u u tmin

H∈H εhH⁻¹, Ii+

t=1

Tr(Diag(gt)H⁻¹Diag(gt))

= v u u tmin

H∈H εhH⁻¹, Ii+

t=1

hH⁻¹,Diag(g_tg_t^T)i

= r

Hmin∈HhH⁻¹,G˜Ti= q

Tr²( ˜G^1/2_T )

= Tr( ˜G^1/2_T ).

With the above corollary, we can compare the regret bound for the Diagonal AdaGrad from Theorem 6.4.3 with the regret bounds for the classic AdaGrad algorithm (Theorem 6.3.2) and with the regret for the Online Mirror Descent algorithm with adaptive step size (Theorem 6.1.1). As expected, the regret bound for AdaGrad seems to be better than the one for its diagonal version.

We can see this by comparing Corollaries 6.3.4 and 6.4.5, which show more palatable ways of writing the traces that appear on the bounds of both algorithms. On Corollary 6.3.4, the minimum is taken over all positive definite matrices in the spectraplex, while in the minima in above corollary the search space is restricted to diagonal matrices. Still, the regret bound for the Diagonal AdaGrad seems to be as good as the one for the OMD algorithm with adaptive step sizes from Theorem 6.1.1.

To see this, recall from (6.13) that we know the (scaled)`₂-norm can be written as a norm induced byd⁻¹I, whered∈N\ {0} is the dimension of the problem. Thus, for a value ofε >0small enough in the above corollary, we conclude that the norm chosen by the above minimum is as good as the

`₂-norm if the goal is to minimize the sum of the norms of the subgradients. However, one problem appears when trying to compare the regret bound for DiagAdaGrad with the bounds on previous sections: the diameterθ∈R++ in Theorem 6.4.3 is w.r.t. the `∞-norm, while in previous sections the`₂-norm was used. Thus, more informative comparisons of the these regret bounds need more information on the setX ⊆R^dfrom where the player picks her points.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 158-164)