OMD Connection with FTRL and Regret Bounds

Proof. Let us prove the statement by induction ont∈[T]. For t= 1 the statement holds by the definitions ofx₁ and x⁰₁. Let t∈ {1, . . . , T −1}and suppose x_t=x⁰_t. Define

y_t+1:=∇R_t+1(x_t)−g_t=∇R_t+1(x⁰_t)−g_t.

By the definition of AdaOMDR on Algorithm 5.1, we have that xt+1 = Π^R_X^t+1(∇R^∗_t+1(yt+1)).

Proposition 5.1.3 states that, since R_t+1 is a mirror map forX, thenR^∗_t+1 is differentiable onEand Rt+1 is differentiable on∇R^∗_t+1(yt+1)∈int(domRt+1). Thus, since(int(domRt+1))∩ri(X)6=∅by the definition of mirror map, Lemma 3.11.4 yields

xt+1 = Π^R_X^t+1(∇R^∗_t+1(yt+1)) ⇐⇒ yt+1− ∇R_t+1(xt+1)∈NX(xt+1). (5.7) Note that

y_t+1− ∇R_t+1(x_t+1) =∇R_t+1(x⁰_t)−g_t− ∇R_t+1(x_t+1) =−g_t−(∇B_R_t+1(·, x⁰_t))(x_t+1), and the latter is minus the gradient ofx∈E7→g^T_tx+B_R_t+1(x, x⁰_t) atxt+1. Therefore, using (5.7) with the above equation and by the optimality conditions from Theorem 3.6.2,

x_t+1 = Π^R_X^t+1(∇R^∗_t+1(y_t+1)) ⇐⇒ −(g_t+ (∇B_R_t+1(·, x⁰_t))(x_t+1))∈N_X(x_t+1)

⇐⇒ xt+1 ∈arg min

x∈X

hg_t, xi+BRt+1(x, x⁰_t) .

This connection between Adaptive OMD and generalized proximal operators gives us yet another intuitive view of online mirror descent. At round t, the algorithm minimizes, as much as possible, the value of the linearized version of the functionft−1 played by enemy on the last round, but tries to not pick a point too far from the last iterate w.r.t. the Bregman divergence based on the current mirror map. In the next section, we will see that the Adaptive OMD can be seen as an application of the Adaptive FTRL algorithm.

Proposition 5.4.1. LetX ⊆Ebe a closed convex set and let R:E→(−∞,+∞] be a mirror map for X such that R is strongly convex on X. Finally, define R_X := R+δ(· |X). Then, for every x∈E,

∇R_X^∗ (x) = Π^R_X(∇R^∗(x)).

Proof. Note that the functions R^∗ andR^∗_X are both differentiable everywhere, the former⁶ by the properties of mirror maps given by Proposition 5.1.3 and the latter by Proposition 3.9.8 since R_X is actually strongly convex on E. Moreover, since int(domR)∩ri(X) is nonempty and since

∂(δ(· |X)(z) = NX(z) by Lemma 3.5.3, we have by Theorems 3.5.4 and 3.5.5 that, for every z∈int(domR)∩X,

N_X(z) +∇R(z) =∂(δ(· |X) +R)(z) =∂R_X(z). (5.8) Finally, by (5.1.ii) from the mirror map definition,Π^R_X(∇R^∗(x))∈int(domR)∩X. Therefore, for any x∈Eand z∈int(domR)∩X,

z= Π^R_X(∇R^∗(x)) ⇐⇒ x− ∇R(z)∈NX(z) by Lemma 3.11.4,

⇐⇒ x∈∂R_X(z) by (5.8),

⇐⇒ z∈∂R_X^∗ (x) ={∇R^∗_X(x)} by Theorems 3.5.2 and 3.5.5.

The above proposition hints at the possibility of passing the restriction of the setX to the OMD oracles through an indicator function added to the mirror map and make it more similar to a FTRL algorithm. It only remains to look at the updates⁷ of the type

y_t+1=∇R(x_t)−g_t (5.9)

from Algorithm 5.2, where R is a mirror map for a set X from an OCO instance C := (X,F).

However, at this point the universe stops its acts of kindness. If we define R_X :=R+δ(· |X), then, assuming that(ri(X))∩dom(riR)is nonempty, by Theorem 3.5.4 we have∂RX(x) =∇R(x) +NX(x) forx∈domR. Since0 is in in the normal cone ofX at any point ofX, we know that ∇R(x) is a subgradient ofR_X atx, and thus EOMDmight work as usual with such a mirror map if we use a subgradient of the mirror map instead of its gradient in (5.9). However, without explicit knowledge of the original mirror mapRand of the indicator function ofX, there is no way to know the “correct”

subgradient of R_X to pick.

After the above discussion, one may guess that some connections between Adaptive OMD and Adaptive FTRL will involve points from the normal cone of X. The following theorem shows one connection in which this is exactly what happens.

Theorem 5.4.2. Let C := (X,F) be an OCO instance such that X is closed and such that each f ∈ F is a proper closed function which is subdifferentiable onX. LetR: Seq(F)→(−∞,+∞]^E be a mirror map strategy forC, letENEMY be an enemy oracle forC, and letT ∈N. Define

(x,f) := OCOC(AdaOMD^X_R,ENEMY, T)

and r_t:=R(hf₁, . . . , ft−1i) for eacht∈ {1, . . . , T+ 1}.

6At first sight, one may think that the differentiability ofR^∗is a consequence Proposition 3.9.8 as well. Note, however, thatRis strongly convex only on a subsetX ofE, and Proposition 3.9.8 only applies to functions which are strongly convex on the whole euclidean space. For FTRL regularizers this was never an issue since we already needed to add the indicator function of the setX ⊆R^d where the player could pick her points and on which, usually, the regularizer was strongly convex.

7We will look at theEOMDoracle for the sake of simplicity, but the same discussion holds for theAdaOMDoracle.

Moreover, letg_t∈∂f_t(x_t) be the same as in the definition ofAdaOMD^X_R(f_1:t) on Algorithm 5.1 for each t∈[T]. Then, there is p_t∈N_X(x_t) for eacht∈[T]such that, for everyt∈[T], both infima

x∈Xinf X^t−1

i=1

hg_i, xi+r₁(x) +

t−1

i=1

(B_r_i+1(x, x_i) +hp_i, xi)

and (5.10)

x∈infE

X^t−1

i=1

hg_i, xi+r₁(x) +hp_t, xi+

t−1

i=1

(B_r_i+1(x, x_i) +hp_i, xi)

(5.11) are attained only by xt. In particular, if we define R⁰: Seq(F)→(−∞,+∞]^E by

R⁰(hi) :=r₁+hp₁,·i+δ(· |X) and by

R⁰(f) :=B_r_t+1(·, x_t) +hp_t+1,·i, ∀f ∈ F^t,∀t∈ {1, . . . , T −1},

and define ht := hg_t,·i for each t ∈[T], then AdaOMD^X_R(f1:t−1) = AdaFTRLR⁰(h1:t−1) for each roundt∈[T].

Proof. SetxT+1 := AdaOMD^X_R(f) and define p1, . . . , pT ∈E, in order, by pt=−X^t−1

i=1

gi+∇r₁(xt) +

t−1

i=1

(∇r_i+1(xt)− ∇r_i+1(xi) +pi)

, ∀t∈[T].

First, let us show that

(5.12) pt∈NX(xt) for eacht∈[T]andxt is the unique point that attains the infimum

in (5.10) for each t∈ {1, . . . , T + 1}.

Let us prove (5.12) by induction onT ∈N. ForT = 0, we have that (5.10) for t= 0 isinfx∈Xr1(x).

By (5.1.ii) the latter infimum is attained, and x₁ ∈ arg min_x∈Xr₁(x) by the definition of the AdaOMD^X_R oracle. LetT ∈N\ {0}. By induction hypothesis, (5.12) holds for T−1, that is, the points p1, . . . , pT−1 are in normal cones as described in (5.12). Thus, we have that

{x_T}= arg min

x∈X

^TX⁻¹

t=1

hg_t, xi+r₁(x) +

T−1

t=1

(B_r_t+1(x, x_t) +hp_t, xi)

Define H(x) :=

P_T−1

t=1hg_t, xi+r₁(x) +P_T−1

t=1(B_r_t+1(x, x_t) +hp_t, xi)

for every x ∈ E. Set D :=

int(domr_T) (which is nonempty and is such that D= int(domr_t) for t∈[T] by the definition of mirror map strategy). One can easily see that int(domH) = D since H is the sum of Bregman divergences w.r.t. the regularizer increments and linear functions. Since (ri(X))∩D 6= ∅ (by the definition of mirror map strategy) we can use the optimality conditions from Theorem 3.6.2.

That is, xT attains infx∈XH(x) if and only if (−∂H(xT))∩NX(xT) is nonempty. SincexT ∈D by the guarantees of the AdaOMD^X_R oracle, we have that r_t is differentiable atx_T for each t∈[T].

Therefore,H is differentiable atx_T, and by Theorem 3.5.5 we have∂H(x_T) ={∇H(x_T)}. Therefore, xT attainsinfx∈XH(X)if and only if

−^TX⁻¹

t=1

gt+∇r₁(x_T) +

T−1

t=1

(∇r_t+1(x_T)− ∇r_t+1(xt) +pt)

=−∇H(x)∈N_X(x_T).

Since the left-hand side of the above equation if exactlyp_T from (5.4), we conclude thatp_T ∈N_X(x_T).

It only remains to show thatx_T₊₁is the unique point that attains the infimum in (5.10) witht=T+1.

To see this, first note that

T−1

t=1

g_t+∇r₁(x_T) +p_T +

T−1

t=1

(∇r_t+1(x_T)− ∇r_t+1(x_t) +p_t) = 0

⇐⇒

T−1

t=1

g_t+∇r₁(x_T) +

t=1

(∇r_t+1(x_T)− ∇r_t+1(x_t) +p_t) = 0

⇐⇒

T+1

t=1

∇r_t(xT) =

t=1

∇r_t+1(xt)−

T−1

t=1

gt−

t=1

pt. Define

RT+1 :=

T+1

t=1

rt and s:=

t=1

gt+

t=1

pt. Since∇R_T₊₁(xT) =PT+1

t=1 ∇r_t(xT), we have

∇R_T₊₁(xT) =

t=1

∇r_t+1(xt)−

T−1

t=1

gt−

t=1

pt=

t=1

∇r_t+1(xt)−s+gT. (5.13) Moreover, recall that by Theorem 5.3.2,

{x_T₊₁}= arg min

x∈X

(hg_T, xi+BRT+1(x, xT)).

Finally, note that, for anyx∈E,

hg_T, xi+B_R_T+1(x, x_T) =hg_T, xi+R_T₊₁(x)−R_T₊₁(x_T)− h∇R_T₊₁(x_T), x−x_Ti

(5.13)

= hg_T, xi+RT+1(x)−RT+1(xT)−

t=1

h∇r_t+1(xt), x−xTi+hs, x−xTi − hg_T, x−xTi

=r1(x)−r1(xT) +

t=1

(rt+1(x)−rt+1(xT)− h∇r_t+1(xt), x−xTi) +hs, x−xTi+hg_T, xTi

=r₁(x)−r₁(x_T) +

t=1

B_r_t+1(x, x_t)−r_t+1(x_T)−r_t+1(x_t) +h∇r_t+1(x_t), x_T −x_ti +hs, x−x_Ti+hg_T, x_Ti.

Therefore, by ignoring the terms which do not depend onx in the above equation (since they do not affect which points attain the minimum for x∈X), we conclude that

{x_T₊₁}= arg min

x∈X

(hg_T, xi+B_R_T+1(x, x_T)) = arg min

x∈X

r₁(x) +

t=1

B_r_t+1(x, x_t) +hs, xi , and the right-hand side of the above equation is exactly the set of points which attain (5.10) for t=T+ 1by the definition of s. This finishes the proof of (5.12).

Finally, let us show that x_t attains (5.11) for each t∈[T]usingp₁, . . . , p_T ∈Eas in (5.4). Let t∈[T]and define

F(x) :=hp_t, xi+X^t−1

i=1

hg_i, xi+r₁(x) +

t−1

i=1

(B_r_i+1(x, x_i) +hp_i, xi) .

Let us show that xt attains infx∈XF(x). Sincext ∈D, we have that ri is differentiable at xt for everyi∈[t]. Hence,F is also differentiable at x_t. Thus, by the form ofp_t from (5.12) we have

∇F(xt) =pt+ X^t−1

i=1

gi+∇r₁(xt) +

t−1

i=1

(∇r_i+1(xt)− ∇r_i+1(xi) +pi) _(5.12)

= 0,

that is∇F(x_t) = 0. Since∂F(x_t) ={∇F(x_t)}by Theorem 3.5.5, we have0∈∂F(x_t), which happens if and only ifxt∈arg min_x∈_EF(x). Moreover, sinceF is strictly convex (sincer1, . . . , rtare strongly convex), we have {x_t}= arg min_x∈_EF(x). That is, xt is the unique point that attainsinfx∈EF(x).

In particular, if we define R⁰ andh∈(R^E)^T as in the statement, for every t∈[T]we have {AdaFTRL_R⁰(h1:t−1)}= arg min

x∈E

X^t−1

i=1

h_i(x) +

i=1

R⁰(h1:i−1)

= arg min

x∈X

X^t−1

i=1

hg_i, xi+r₁(x) +hp₁, xi+

i=2

(B_r_i(x, xi−1) +hp_i, xi)

= arg min

x∈X

X^t−1

i=1

hg_i, xi+r₁(x) +hp_t, xi+

t−1

i=1

(B_r_i+1(x, x_i) +hp_i, xi)

={AdaOMD^X_R(f1:t−1)},

where in the last equation we used that the infimum in (5.11) is attained by the same point if we add δ(· |X) to it.

The above theorem states that the Adaptive OMD algorithm can be seen as the application of the Adaptive FTRL algorithm with a very interesting proximal regularizer strategy. At each roundt, the regularizer increment of the AdaFTRL oracle is a Bregman divergence w.r.t. the t-th increment from the original mirror map strategy. Not only that, there are some special vectors from the normal cone of the setX from where the player picks her points that crawl up in the FTRL formula. Intuitively, they skew a bit the minimization formula of the originalAdaFTRL oracle so that the iterates match the ones of theAdaOMDoracle. Intuitively, this is the part that accounts for the possibility of choice of subgradient from the mirror map plus the indicator function we had discussed in our hypothetical version ofAdaOMD.

Moreover, note that the infimum (5.11) is unfair in the sense the it “looks into the future”. To decide the iteratex_t+1 from roundt+ 1it needs the point on the normal cone of X at x_t+1. Even though this formula is not pratically implementable, it will help us in deriving regret bounds for AdaOMDfrom the tools we have developed on Chapter 4. Specifically, the following theorem applies the lemmas from Section 4.3 in a way very similar to the way we did to derive regret bounds for AdaFTRL from Section 4.4. Unfortunately, just blindly applying one of the theorems to the above FTRL regularizer strategies does not yield the regret bounds that we want. If we do so, the points in the normal cone crawl into the bound in undesired ways: either inside the dual norms together with the subgradients, or in the regret formula in not very desirable ways.

Finally, it is worth saying that this section is far from showing one the simplest ways to derive regret bound for the Adaptive OMD algorithm. The reason we are showing these connections (mainly based on the work of McMahan [48]) is two-fold. First, one of the main purposes of this text is to analyze the connections among many algorithms from Online Convex Optimization, a interesting fact which will be better discussed on Chapter 7. Second, some proofs of OMD regret bounds rely on potential functions (e.g., see [11]) or other smart tricks. This proof is, arguably, more “automatic”

in the sense that we just apply what we already know, without major secrets and tricks.

Theorem 5.4.3. Let C := (X,F) be an OCO instance such that X is closed and such that each f ∈ F is a proper closed function which is subdifferentiable onX. LetR: Seq(F)→(−∞,+∞]^E be a mirror map strategy forC, let ENEMYbe an enemy oracle forC, and let T ∈N. Moreover, define

(x,f) := OCOC(AdaOMD^X_R,ENEMY, T),

r_t:=R(hf₁, . . . , ft−1i), for eacht∈[T], x0 :=x1, andr0:=r1.

Finally, let gt ∈ ∂ft(xt) be the same as in the definition of AdaOMD^X_R(f) on Algorithm 5.1 for eacht∈[T]. If for every t∈[T]there isσt∈R++ such that Pt

i=1ri isσt-strongly convex w.r.t. a normk·k_(t) on X, then,

Regret(AdaOMD^X_R,f, u)≤

T+1

t=1

Brt(u, xt−1) +1 2

t=1

σ_t+1kg_tk²_(t+1),∗, ∀u∈X.

Proof. Define ht := hg_t,·i for each t ∈ [T], define xT+1 := AdaOMD^X_R(f), and let u ∈ X. By⁸ Theorem 5.4.2, there are pt∈N_X(xt) for each t∈ {1, . . . , T + 1} such that, if we define R⁰:F → (−∞,+∞]^E byR⁰(hi) :=r₁+hp₁,·i+δ(· |X) and by

R⁰(f) :=B_r_t+1(·, x_t) +hp_t+1,·i, ∀f ∈ F^t,∀t∈[T],

then, AdaOMD^X_R(f1:t−1) = AdaFTRLR⁰(h1:t−1) for each t∈ {1, . . . , T + 1}. Therefore, using the subgradient inequality we have,

Regret(AdaOMD^X_R,f, u) =

t=1

(f_t(x_t)−f_t(u))≤

t=1

hg_t, x_t−ui=

t=1

(h_t(x_t)−h_t(u))

= Regret(AdaFTRLR⁰,h, u).

Make the following definitions:

b_t:=B_r_t(·, x_t−1) for eacht∈ {2, . . . , T + 1}, b₁ :=r₁+δ(· |X),

Ht:=

i=1

hi+

t+1

i=1

(bi+hp_i,·i) for eacht∈[T], x₀ :=x₁.

8We use Theorem 5.4.2 for a game withT+ 1rounds so that it yields the pointpT+1∈NX(xT+1)that we use in this proof.

By Lemma 4.3.1 we have Regret(AdaFTRLR⁰,h, u)≤

T+1

t=1

(bt(u)−bt(xt−1)) +

t=0

hp_t+1, u−xti+

t=1

(Ht(xt)−Ht(xt+1))

≤

T+1

t=1

(b_t(u)−b_t(xt−1)) +

t=1

hp_t+1, u−x_ti+

t=1

(H_t(x_t)−H_t(x_t+1)), (5.14) where in the last inequality we have used that hp₁, u−x0i = hp₁, u−x1i ≤0 since p1 ∈NX(x1) and u∈X. Let us now show that

H_t(x_t)−H_t(x_t+1)≤ 1 2σt+1

kg_tk²_(t+1),∗+hp_t+1, x_t−x_t+1i, ∀t∈[T]. (5.15) Lett∈[T]. Sincedomht=Eand sinceHt−1+bt+1 is proper, we have thatri(dom(Ht−1+bt+1))∩ ri(domh_t) is nonempty. Moreover, x_t ∈ arg min_x∈_E(Ht−1(x) +b_t+1(x)) since x_t minimizes Ht−1

by the definition of AdaFTRL and clearly minimizes bt+1 =Brt+1(·, x_t). Finally, since Pt+1 i=1ri is σt+1-strongly convex (w.r.t. k·k_(t+1)) onX, the functionPt+1

i=1bi is also σt+1-strongly convex onX (see Lemma 3.11.2), but since dom(Pt+1

i=1bi) ⊆X, we have that Pt+1

i=1bi is actually σt+1-strongly convex onE. This implies thatHt−1+b_t+1is also isσ_t+1-strongly convex. Therefore, by Lemma 4.3.2 and since ∇h_t(x) =g_tfor every x∈E, we have

H_t(x_t)−H_t(x_t+1) =Ht−1(x_t) +b_t+1(x_t) +h_t(x_t)−(Ht−1(x_t+1) +b_t+1(x_t+1) +h_t(x_t+1)) +hp_t+1, x_t−x_t+1i

≤ 1

2σ_t+1kg_tk²_(t+1),∗+hp_t+1, xt−xt+1i.

This ends the proof of (5.15). Putting together (5.14) and (5.15) yields Regret(AdaFTRLR⁰,h, u)≤

T+1

t=1

(b_t(u)−b_t(xt−1)) +

t=1

hp_t+1, u−x_ti+

t=1

(H_t(x_t)−H_t(x_t+1))

≤

T+1

t=1

(b_t(u)−b_t(xt−1)) +

t=1

hp_t+1, u−x_t+1i+

t=1

1 2σt+1

kg_tk_(t+1),∗

≤

T+1

t=1

(bt(u)−bt(xt−1)) +

t=1

1 2σt+1

kg_tk_(t+1),∗

T+1

t=1

Brt(u, xt−1) +

t=1

2σ_t+1kg_tk_(t+1),∗,

where in the second inequality we have used that, hp_t+1, u−xt+1i ≤ 0 for every t ∈ [T] since pt+1 ∈ NX(xt+1) and u ∈ X, and in the last equation we have used the definition of bt for t ∈ {1, . . . , T + 1} and the fact that r₁(u)−r₁(x₁) ≤ B_r₁(u, x₁) since ∇r₁(x₁) ∈ N_X(x₁) and, thus,h∇r₁(x1), u−x1i ≤0.

Recall that if R is a mirror map for a convex and closed set X and C := (X,F) is an OCO instance, thenR: Seq(F)→(−∞,+∞]^Egiven byR(f) := [f 6=hi]R is a mirror map strategy forC.

Thus, an immediate corollary of the above theorem together with Theorem 3.8.4 is a regret bound for the EOMD oracle which matches the regret bound of the classic FTRL algorithm.

Corollary 5.4.4 (Derived from Theorem 5.4.3). Let C:= (X,F)be an OCO instance such that X is closed and such that eachf ∈ F is a proper closed function which is subdifferentiable onX . Let R: (−∞,+∞]→Ebe a mirror map for X, let ENEMYbe an enemy oracle for C, and let T ∈N. Moreover, define

(x,f) := OCOC(AdaOMD^X_R,ENEMY, T),

Finally, letg_t∈∂f_t(x_t) be as in the definition ofEOMD^X_R(f) on Algorithm 5.2 for each t∈[T]. If σ∈R++ is such thatR isσ-strongly convex w.r.t. a normk·k onE, then,

Regret(EOMD^X_R,f, u)≤B_R(u, x₁) + 1 2σ

t=1

kg_tk²_∗, ∀u∈X, (5.16) wherex0:=x1. In particular, if every function inFisρ-Lipschitz continuous w.r.t.k·kon a convex set D⊆Esuch thatX ⊆int(D), there isθ∈R++such thatθ≥sup{B_R(, yy) : x∈X, y∈X∩int(domR)}, and R⁰ := ρ√

T /(√ 2σθ)

R is also a mirror map for X, then Regret(EOMD^X_R⁰,ENEMY, X)≤ρ

r2θT σ .

Proof. Note that EOMD^X_R = AdaOMDR where R is given by R(f) := [f =hi]R for every f ∈ Seq((−∞,+∞]^E). Moreover, since R is mirror map for X, R is a mirror map strategy for C.

Therefore, the first inequality is a direct application of Theorem 5.4.3 together with the fact the Ris σ-strongly convex onX w.r.t. k·k.

If each f ∈ F isρ-Lipschitz continuous w.r.t.k·kon a convex set Dsuch thatX ⊆int(D), then by Theorem 3.8.4 we have that ∂f(x)⊆ {g∈E:kgk_∗≤ρ} for eachf ∈ F and x∈X. Using this in (5.16) yields

Regret_T(EOMD^X_R,ENEMY, u)≤BR(u, x1) +T ρ² 2σ .

Moreover, suppose there is θ∈R++ such thatθ≥sup{R(x)−R(y) : x∈X, y∈X∩domR}, and define

R⁰ := ρ√

√ T 2σθR.

Note that R⁰ is a (ρ√ σT /√

2θ)-strongly convex on X. Finally, suppose R⁰ is also a mirror map.

Then, pluggingR⁰ into the above inequality yields, for everyu∈X, Regret_T(EOMD^X_R⁰,ENEMY, u)≤ ρ√

√ T

2σθBR(u, x1) +ρ√

√θT

2σ ≤ ρ√

√θT

2σ +ρ√

√θT

2σ =ρ r2θT

σ , where in the second inequality we took the supremum overu∈X.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 123-130)