Fundamental Lemmas for Regret Bounds - Online Convex Optimization: Algorithms, Learning, and Du

{∇f_t(x)} (the latter by Theorem 3.5.5) for any x ∈E andt ∈ [T], optimality conditions (Theo-rem 3.6.2) yield, for eacht∈[T],

x_t=−η

t−1

i=1

∇f_i(x_i) =−η

t−1

i=1

g_i =⇒ x_t= [t >1](xt−1−ηgt−1),

where the implication follows by a simple induction since x1 = 0 in this case. That is, in this unconstrained case the Adaptive FTRL with squared`₂ regularization is exactly the well-known gradient descent algorithm! A reader familiar with gradient descent is probably thinking how to use time-varying values forη, thestep size. For this case, define Rby

R(h) :=

1 ηt+1

−[t >0]1 ηt

2k·k²₂, ∀t∈N,h∈ F^t,

where η: N\ {0} → R++. Then, following similar steps to the static `₂-norm case, we get the update rulext= [t >1](xt−1−ηtgt−1). On Section 4.5 we will look at the general case with static regularizers, and on Section 4.6 we will look at time-varying step sizes. Moreover, the connections among FTRL and different variants of gradient descent will be further investigated in Chapter 5.

There are several other regularizer strategy examples which we will look at throughout the remaining of the text. For example, in both of the previous examples we have always used squared norm as the regularizer at each round, changing only maybe the scaling factor. One option is too look at the squared distance from the previous iterate, that is, use at roundta regularizer increment of the type x∈E7→ kx−xtk². We will look at these types of regularizers on Section 4.7. Another interesting option is, for eacht∈N\ {0}, to have a positive definite matrix A_t∈R^n×n, and to use at round t a regularizer of the type x∈Rⁿ7→x^TA_tx. That is, at each round use a different (squared) norm induced by some matrix. This is a path which we investigate on Chapter 6.

and set x₀ :=x₁. IfH_t is proper and its infimum overE is attained for everyt∈ {0, . . . , T}, then, for every u∈E,

Regret(AdaFTRLR,f, u)≤

T+1

t=1

(rt(u)−rt(xt−1)) +

t=1

(Ht(xt)−Ht(xt+1)). (4.6) Proof. For each t∈ {0, . . . , T}, defineht:=rt+1+ [t >0]ft. In this way, we have

x_t∈arg min

x∈E

Ht−1(x) = arg min

x∈E t−1

i=0

h_i(x), ∀t∈ {1, . . . , T + 1}. (4.7) Let us now bound the regret of the pointsx₁, . . . , x_T with respect to the functionsh₁, . . . , h_T and to a comparison pointu∈E (plus a−h₀(u) term):

t=1

(ht(xt)−ht(u))−h0(u) =

t=1

ht(xt)−HT(u) =

t=1

(Ht(xt)−Ht−1(xt))−HT(u)

(4.7)

≤

t=1

(Ht(xt)−Ht−1(xt))−H_T(x_T₊₁)

t=1

(H_t(x_t)−H_t(x_t+1))−H₀(x₁),

where in the last equation we just re-indexed the summation, placing H_T₊₁(x_T₊₁) inside the summation, and leavingH0(x1)out. Re-arranging the terms and using H0 =h0=r1 andx0=x1

yield

t=1

(f_t(x_t) +r_t+1(x_t)−f_t(u)−r_t+1(u)) =

t=1

(h_t(x_t)−h_t(u))

≤r₁(u)−r₁(x₀) +

t=1

(H_t(x_t)−H_t(x_t+1)), which implies

Regret(AdaFTRLR,f, u) =

t=1

(f_t(x_t)−f_t(u))≤

T+1

t=1

(r_t(u)−r_t(xt−1))+

t=1

(H_t(x_t)−H_t(x_t+1)).

The above lemma has a quite straightforward proof, so much so that one may finish reading it with a feeling that we have not done much by proving this lemma. Indeed, most of the proof boils down to rewriting the regret expression in a way in which the terms are displayed in a more palatable way. Interestingly, the only inequality used in the whole proof of the lemma is due to (4.7), which holds by the definition of the AdaFTRL algorithm.

The Strong FTRL Lemma bounds the regret of AdaFTRL by two sums. The first is usually bounded by some kind of per-round diameter, as measured by the regularizer, of the set X ⊆E where the player is making her predictions (assuming u ∈ X as well). This already shows that the choice of a regularizer will be heavily influenced by the setX. The second sum translates the intuition we talked about in the beginning of the chapter: it measures the stability of consecutive iterates. The player then has to balance two competing factors. On the one hand, she wants to

minimize the raw values of the functions H_t by the definition of the AdaFTRLoracle. On the other hand, her choices from one round to another should not change too abruptly, so the terms of the formHt(xt)−Ht(xt+1) on the bound do not become too high.

As one may have noticed, consecutive iterates from the AdaFTRL algorithm are minimizers of functions which are, in some sense, similar. More explicitly, letF ⊆(−∞,+∞]^Ebe nonempty, letf ∈ F^t for somet∈N\ {0}, and letRbe any appropriate regularizer strategy forF. Define consecutive iterates x_t := AdaFTRL_R(hf₁, . . . , ft−1i) and x_t+1 := AdaFTRL_R(hf₁, . . . , f_ti). Note that x_t minimizesH :=Pt

i=1r_i+Pt−1

i=1f_i, andx_t+1 minimizesH+f_t+r_t+1, wherer_i:=R(hf₁, . . . , f_ii)for each i∈ {0, . . . , t}. Looking from this perspective, one may wonder if we can say something about the distance between x_t andx_t+1 (or the difference between the values ofH+f_t+r_t+1 at these points). For example, in the case where f_t+r_t+1 does not vary much throughoutE we can guess thatxtandxt+1 and the values of H+ft+rt+1 at these points are close. Unfortunately, if we allow the functions fromf and the regularizers delivered byR to be arbitrary, we cannot guarantee much.

This is a point where convexity starts to play a major role in the analysis of AdaFTRL. The next lemma shows that if H+ft is strongly convex², then the distance between xt and xt+1 is bounded by the dual norm of the subgradients of ft atxt. Even though a bound which depends on the dual norm of the subgradient may seem to lack any intuitive meaning at first, recall that such a quantity is deeply connected with the Lipschitz continuity constant of ft (see Theorem 3.8.4). Thus, the following lemma tell us that ifft isρ-Lipschitz continuous for smallρ andH is strongly convex (for example), then addingf_t toH does not move much the points which attain the minimum. It is worth noting that, when we apply this result, it is not always that case that F from the statement is of the same form as the functionH from our current discussion (though usuallyF will only be slightly different fromH). As a matter of fact, the different functions the we plug intoF in different applications of the lemma when bounding terms from the Strong FTRL Lemma yield similar bounds, but with important “off-by-one” differences which will be discussed in the next section.

Lemma 4.3.2 ([48, Lemma 7]). LetF, f:E→(−∞,+∞]be closed proper convex functions such thatF+f isσ-strongly convex with respect to a normk·konEand such thatinfx∈EF(x)is attained, and let x¯ ∈ arg min_x∈_EF(x). If ri(domF)∩ri(domf) is nonempty, then infx∈E(F(x) +f(x)) is attained and, for anyg∈∂f(¯x),

k¯x−yk ≤¯ _σ¹kgk∗ ∀¯y∈arg min

x∈E

(F(x) +f(x)) and

F(¯x) +f(¯x)−(F(u) +f(u))≤ 1

2σkgk²_∗, ∀u∈E.

Proof. Let g ∈ ∂f(¯x) and define φ:= F+f− hg,·i. Since F +f is σ-strongly convex w.r.t. k·k and −hg,·i is convex, we have that φ is also σ-strongly convex w.r.t. k·k. Thus, by the strong convexity/smoothness duality (Theorem 3.10.2), we have that

φ^∗ is _σ¹-strongly smooth with respect to k·k_∗. (4.8) By Theorem 3.2.7, the sum of closed convex functions is itself a closed function. Thus, F+f is closed, and sinceF +f is strongly convex, by Lemma 3.9.14 there isy¯∈arg min_x∈_E(F(x) +f(x)).

2We do not addrt+1 here since we want to use bounds which depend on the subgradients offt, not offt+rt+1. How we deal with this extrart+1term will become clear when we apply Lemma 4.3.2 to derive regret bounds in the next section. One example of a case where it is easy to deal with this term is in the classical FTRL case, where we use a static regularizer (i.e., one that does not change throughout the game). In this case,R(f) = 0for any nonemptyf ∈Seq(F). Thus,H+ft+rt+1=H+ft for anyt∈[T]sincert= 0 fort∈ {2, . . . , T+ 1}in this case.

Assume for now that

x=∇φ^∗(0) and y¯=∇φ^∗(−g). (4.9) We will prove the above claim later. With that, sinceφ^∗ is(1/σ)-strongly smooth, by the definition of strong smoothness and sincek·k_∗∗=k·kby Theorem 3.8.2 we have

kx¯−yk¯ ^(4.9)= k∇φ^∗(0)− ∇φ^∗(−g)k ≤ 1 σkgk_∗.

To prove the second inequality from the statement, note that by Theorem 3.5.2 (items (iv) and (v)) together with (4.9), we have

h0,xi¯ ^(4.9)= h0,∇φ^∗(0)i^Thm.=^3.5.2φ^∗(0) +φ(∇φ^∗(0))^(4.9)= φ^∗(0) +φ(¯x)

=⇒ F(¯x) +f(¯x)− hg,xi¯ =φ(¯x) =−φ^∗(0)

(4.10) and

h−g,yi¯ ^(4.9)= h−g,∇φ^∗(−g)i^Thm.=^3.5.2φ^∗(−g) +φ(∇φ^∗(−g))^(4.9)= φ^∗(−g) +φ(¯y)

=⇒ F(¯y) +f(¯y) =φ(¯y) +hg,yi¯ =−φ^∗(−g).

(4.11) Moreover, (4.8) together with Lemma 3.10.1 implies

φ^∗(y)≤φ^∗(x) +hy−x,∇φ^∗(x)i+ 1

2σky−xk²_∗, ∀x, y∈E. (4.12) Therefore, for everyu∈E,

F(¯x) +f(¯x)− hg,xi −¯ (F(u) +f(u))

≤F(¯x) +f(¯x)− hg,xi −¯ (F(¯y) +f(¯y)) since y¯∈arg min

x∈E

(F(x) +f(x))

=−φ^∗(0) +φ^∗(−g) by (4.10) and (4.11)

≤ h−g,∇φ^∗(0)i+ 1

2σkgk²_∗ by (4.12)

=−hg,xi¯ + 1

2σkgk²_∗ by (4.9).

Since the hg,xi¯ terms above cancel out, this yields the second inequality from the statement.

Finally, it only remains to prove (4.9). Sinceri(domF)∩ri(domf)is nonempty, by Theorem 3.5.4 we have

∂φ(x) =∂F(x) +∂f(x)−g, ∀x∈E. (4.13) Since x¯ minimizesF, we have 0∈∂F(¯x)by the definition of subgradient. Thus, g∈∂f(¯x) together with (4.13) implies 0∈∂φ(¯x). SinceF, f, and hg,·iare closed we have thatφis closed as well by Theorem 3.2.7. Thus, by Theorem 3.5.2, we have x¯∈∂φ^∗(0).

Similarly, since y¯ minimizes F +f, we have 0 ∈ ∂(F +f)(¯y) = ∂F(¯y) +∂f(¯y), where the equality holds by Theorem 3.5.4. This with (4.13) yields −g∈∂φ(¯y), and again by Theorem 3.5.2 we have y¯∈∂φ^∗(−g) since φis closed by Theorem 3.2.7.

To complete the proof of (4.9), note that since φ^∗ is strongly smooth, it is differentiable by the definition of strong smoothness. Therefore, by Theorem 3.5.5 we have ∂φ^∗(x) ={∇φ^∗(x)} for every x∈E, which completes the proof of (4.9).

We stress here that the condition over the relative interior of the domains of the functions on the above lemma is extremely fundamental for its proof, but normally one need not worry (much) about it. As we are going to see on the next section, we will need this condition to be satisfied, for everyt∈N, by the domain of functions whose form³ is usually the sum of enemy choices and regularizer increments up to roundt with the domain of the regularizer increment r_t+1 of round t+ 1. In many applications, such as the problems described in Section 2.5, the functions played by the enemy have all the same domain (usuallyE), and the player has control over the regularizers.

Therefore, we will hardly meet an OCO instance in the next sections and chapters where such conditions are not satisfied. Still, every time we need this type of condition on the domains of the functions for some result, we clearly describe it in the statement of the result. In this way, every condition needed for the results to hold, even if met by most or all of the problems we see in this text, are always clearly stated. Thus, one may keep in mind that the sole purpose of most of the stated conditions about the intersection of the relative interiors of function domains in the results of this chapter are meant to enable us to apply the above lemma.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 89-93)