Regret Bounds for the Adaptive FTRL Algorithm

We stress here that the condition over the relative interior of the domains of the functions on the above lemma is extremely fundamental for its proof, but normally one need not worry (much) about it. As we are going to see on the next section, we will need this condition to be satisfied, for everyt∈N, by the domain of functions whose form³ is usually the sum of enemy choices and regularizer increments up to roundt with the domain of the regularizer increment r_t+1 of round t+ 1. In many applications, such as the problems described in Section 2.5, the functions played by the enemy have all the same domain (usuallyE), and the player has control over the regularizers.

Therefore, we will hardly meet an OCO instance in the next sections and chapters where such conditions are not satisfied. Still, every time we need this type of condition on the domains of the functions for some result, we clearly describe it in the statement of the result. In this way, every condition needed for the results to hold, even if met by most or all of the problems we see in this text, are always clearly stated. Thus, one may keep in mind that the sole purpose of most of the stated conditions about the intersection of the relative interiors of function domains in the results of this chapter are meant to enable us to apply the above lemma.

cases will need slightly different assumptions. These assumptions are encapsulated in the concept of strong regularizer strategy.

Definition 4.4.2 (σ-strong and σ-proximally strong FTRL regularizer strategies). LetT ∈N, let σ ∈ R^T++, let C := (X,F) be an OCO instance, let f ∈ F^T, and let R be an FTRL regularizer strategy for C. Define

rt:=R(hf₁, . . . , ft−1i) for eacht∈ {1, . . . , T + 1}and Ht:=

t+1

i=1

ri+

i=1

fi for eacht∈ {0, . . . , T}.

We say that Risσ-strongfor f (with respect to norms k·k₍₁₎,k·k₍₂₎, . . . ,k·k_(T₎ on E) if (i) Ht−1+f_t isσ_t-strongly convex⁴ w.r.t.k·k_(t) for eacht∈[T]and

(ii) ri(domHt−1)∩ri(domf_t) is nonempty for eacht∈[T].

If k·k:=k·k₍₁₎ =k·k₍₂₎ =· · ·=k·k_(T₎, we may say that Ris σ-strong w.r.t. k·k. Likewise, we say thatR isσ-proximally strongfor f (with respect to normsk·k₍₁₎,k·k₍₂₎, . . . ,k·k_(T₎ on E) if

(i) Ht−1 isσ_t-strongly convex w.r.t.k·k_(t) for each t∈[T],

(ii) ri(dom(Ht−1+rt+1))∩ri(domft) is nonempty for eacht∈[T], and (iii) Ris proximal.

Ifk·k:=k·k₍₁₎ =k·k₍₂₎ =· · ·=k·k_(T₎, we may say that Ris σ-proximally strong w.r.t. k·k.

The second condition in each of the above definitions is a technical assumption to apply Lemma 4.3.2 which is usually easily satisfied. Even though it is important to know that such a condition on the effective domains is needed, in the most common OCO instances this condition will be easily satisfied. Usually, the functions from the set F ⊆(−∞,+∞]^E from where the enemy picks his functions have all the same domain. In this way, it is not hard to design regularizers which satisfy either (ii) or (ii) for the proximal case. We shall see some examples soon. The first condition on each of these definitions deals with the the strong convexity parametersσ of the functions, which will play a major role in the following regret bounds. Moreover, it is worth saying that even though the strong convexity assumption is over the “cumulative functions” H_t, since the functions played by the enemy are convex, σ will usually be determined single-handedly by the regularizer strategy.

For example, consider the OCO instance C := (R^d,F), where F is a set of closed proper convex functions on R^d with effective domain equal to R^d. Recall from Lemma 3.9.5 that the function (1/2)k·k²₂ is 1-strongly convex w.r.t.k·k₂. Thus, for any T ∈N and f ∈ F the function (1/2)k·k²₂+PT

t=1ftis also1-strongly convex w.r.t.k·k₂ since summing convex functions to a strongly convex function preserves strong convexity. Finally, sincedomf =R^d for eachf ∈ F, we conclude that the FTRL regularizer strategy given by R(f) := [f =hi](1/2)k·k²₂ for each f ∈ Seq(F) is 1-strong for any function sequence inSeq(F), where1 is a properly sized sequence with all entries equal to1. One example of proximally strong FTRL regularizer strategy is one with the regularizer increment on round t ∈N of the type x 7→ kx−[t >1]xt−1k, where xt−1 is the iterate from the previous round. We will look at these type of regularizers in details on Section 4.7. Let us now prove general regret bounds for the Adaptive FTRL oracle with strong and proximally strong FTRL regularizer strategies. Again, we note that the following proofs rely mainly on the lemmas from the previous section.

4Note that, for eacht∈[T], the normk·k(t) may be influenced by the regularizer incrementsr1, . . . , rt, i.e. the ones chosen up to roundt.

Theorem 4.4.3(GeneralAdaFTRLRegret Bound). LetC := (X,F)be an OCO instance such that eachf ∈ F is proper and closed. LetR: Seq(F)→(−∞,+∞]^E be a FTRL regularizer strategy, let T ∈N, and letENEMYbe an enemy oracle for C. Moreover, define

(x,f) := OCOC(AdaFTRLR,ENEMY, T),

rt:=R(hf₁, . . . , ft−1i) for eacht∈ {1, . . . , T + 1},

Finally, suppose there exists⁵ gt ∈ ∂ft(xt) for each t∈ [T]. If σ ∈R^T₊₊ andR is σ-strong for f w.r.t. norms k·k₍₁₎, . . . ,k·k_(T₎ on E, thenx∈Seq(X)and

Regret(AdaFTRLR,f, u)≤

t=1

(rt(u)−rt(xt)) + 1 2

t=1

σ_tkg_tk²_(t),∗. Proof. Define

Ht:=

t+1

i=1

ri+

i=1

fi for each t∈ {0, . . . , T}.

First of all, sinceinfx∈EHt(x) = infx∈E(Pt+1

i=1ri(x) +Pt

i=1fi(x))is attained for everyt∈ {0, . . . , T}, we have that AdaFTRLR(hf₁, . . . , f_ti) is properly defined for eacht∈ {0, . . . , T}. Moreover, since domr₁ ⊆X, we havex∈Seq(X) by the definition ofAdaFTRLR.

Definex0 :=x1andxT+1 := AdaFTRL_R(hf₁, . . . , fTi). By the Strong FTRL Lemma (Lemma 4.3.1), we have

Regret(AdaFTRLR,f, u)≤

T+1

t=1

(rt(u)−rt(xt−1)) +

t=1

(Ht(xt)−Ht(xt+1))

=−r₁(x0) +

T+1

t=1

rt(u) +

t=1

(Ht(xt)−Ht(xt+1)−rt+1(xt))

=−r₁(x₁) +

t=0

r_t+1(u) +

t=1

(H_t(x_t)−H_t(x_t+1)−r_t+1(x_t)).

(4.14)

Let t ∈ [T]. By assumption, ri(domHt−1)∩ri(domft) is nonempty and Ht−1+ft is σt-strongly convex w.r.t. k·k_(t). Thus, sincex_t∈arg min_x∈_EHt−1(x), by Lemma 4.3.2 withF :=Ht−1 (which is closed since the sum of closed functions is closed by Theorem 3.2.7) and f :=f_twe have

Ht(xt)−Ht(xt+1)−rt+1(xt) =Ht−1(xt) +ft(xt) +rt+1(xt)−Ht−1(xt+1)−ft(xt+1)

−r_t+1(x_t+1)−r_t+1(x_t)

≤ 1

2σ_tkg_tk²_(t),∗−r_t+1(x_t+1).

Plugging the above inequality for everyt∈[T]into (4.14) yields Regret(AdaFTRLR,f, u)≤

t=0

(rt+1(u)−rt+1(xt+1)) +1 2

t=1

σ_tkg_tk²_(t),∗

T+1

t=1

(r_t(u)−r_t(x_t)) +1 2

t=1

σ_tkg_tk²_(t),∗.

(4.15)

5From Theorem 3.5.1, we know that a convex function is always subdifferentiable on the relative interior of its domain. Thus, it is usually hard to find a case where the functions played by the enemy are not subdifferentiable at the iterates from the player.

We are almost done: there is still an extra term in the first summation when compared to the bound on the statement. Note, however, that if we set r_T₊₁ := 0, then the iterates delivered by the AdaFTRLR oracle over the sub-sequences of f would still be x1, . . . , xT. We can do such a modification formally by definingR⁰ byR⁰(f) := 0, and by making it equal toRon Seq(F)\ {f}.

In this way, we have x_t= AdaFTRL_R⁰(hf₁, . . . , ft−1i) for each t∈[T], as argued. Therefore, Regret(AdaFTRLR,f, u) = Regret(AdaFTRLR⁰,f, u)

(4.15)

≤

t=1

(r_t(u)−r_t(x_t)) + 1 2

t=1

1 σt

kg_tk²_(t),∗.

Theorem 4.4.4 (Proximal AdaFTRLRegret Bound). Let C:= (X,F) be an OCO instance such that eachf ∈ F is proper and closed. LetR: Seq(F)→(−∞,+∞]^Ebe a proximal FTRL regularizer strategy, let T ∈N, and letENEMY be an enemy oracle forC. Moreover, define

(x,f) := OCOC(AdaFTRLR,ENEMY, T),

rt:=R(hf₁, . . . , ft−1i) for eacht∈ {1, . . . , T + 1},

Finally, suppose there existsg_t∈∂f_t(x_t) for each t∈[T]. If σ∈R^T++and Risσ-proximally strong for f w.r.t. norms k·k₍₁₎, . . . ,k·k_(T₎ on E, thenx∈Seq(X)and

Regret(AdaFTRLR,f, u)≤

t=0

(rt+1(u)−rt+1(xt)) + 1 2

t=1

σ_t+1kg_tk²_(t+1),∗. Proof. Define

H_t:=

t+1

i=1

r_i+

i=1

f_i for each t∈ {0, . . . , T}.

First of all, sinceinfx∈EH_t(x) = infx∈E(Pt+1

i=1r_i(x) +Pt

i=1f_i(x))is attained for everyt∈ {0, . . . , T}, we have that AdaFTRLR(hf₁, . . . , f_ti) is properly defined for eacht∈ {0, . . . , T}. Moreover, since domr1 ⊆X, we havex∈Seq(X) by the definition ofAdaFTRLR.

Define x₀ := x₁ andx_T₊₁ := AdaFTRL(f). By the Strong FTRL Lemma (Lemma 4.3.1), we have

Regret(AdaFTRLR,f, u)≤

t=0

(r_t+1(u)−r_t+1(x_t)) +

t=1

(H_t(x_t)−H_t(x_t+1)). (4.16) Let⁶t∈[T]. By assumption,ri(dom(Ht−1+rt+1))∩ri(domft)is nonempty andHt=Ht−1+rt+1+ft

is σ_t+1-strongly convex w.r.t. k·k_(t+1). Moreover, we have x_t ∈ arg min_x∈_EHt−1(x) and x_t ∈ arg min_x∈_Ert+1(x) (recall thatRis proximal). Thus,xt∈arg min(Ht−1(x) +rt+1(x)). Finally, we can apply Lemma 4.3.2 withF :=Ht−1+rt+1 (which is closed since the sum of closed functions is closed by Theorem 3.2.7) andf :=f_t, which yields

Ht(xt)−Ht(xt+1) =F(xt) +f(xt)−F(xt+1)−f(xt+1)≤ 1 2σt+1

kg_tk²_(t+1),∗, ∀g_t∈∂ft(xt).

Plugging the above inequality for every t∈[T] into (4.16) yields the bound from the statement.

6Up to this point, the proof is identical to the one from Theorem 4.4.3. The main differences appear from now on, which is when we use Lemma 4.3.2.

Let f₁, . . . , f_T ∈ F for some F ⊆ (−∞,+∞]^E, and let R: Seq(F) → (−∞,+∞]^E. When applyingAdaFTRLR, we usually choose regularizer functions which are strongly convex. That is, we choose Rsuch that the sum of regularizer incrementsPt

i=1ri is strongly convex for each t∈[T], wherer_i:=R(hf₁, . . . , fi−1i) for eachi∈[T]. However, on the regret bounds stated above, we make assumptions on the strong convexity of the functionsH_t:=Pt+1

i=1r_i+Pt

i=1f_i. The reason for that is to capture the case where the functions ft themselves are strongly convex, sometimes making AdaFTRL have low-regret guarantees without any regularization at all.

Let us now compare both of these theorems. Note that they are very similar, with the main difference appearing on the indices of the norms on the second summation on each of the bounds. Let us try to understand better what are the implications of these “off-by-one” differences. Letf₁, . . . , f_T and r₁, . . . , r_T₊₁ be as in Theorem 4.4.3, and let t∈ {1, . . . , T −1}. Recall that, at round t, the player and the enemy choose, respectively, xt and ft simultaneously. Since xt as defined by the AdaFTRL oracle is the first iterate which depends on rt, it is at round t that the player has to choose the regularizerr_t.

Note that on both theorems the norm k·k_(t) is related to the regularizers r1, . . . , rt. Since these regularizer increments are up to the player to choose, then the player partially⁷ chooses the parameters of strong convexity and the normsk·k_(t). With that in mind, the player will probably want to choose a regularizer strategy which yields norms and parameters that give better guarantees on the regret. That is, at round t the player will try to come up with a regularizer increment rt

strongly convex w.r.t. a norm k·k_(t) which makes small the terms measured with its dual norm.

Note, however, that on the general case (Theorem 4.4.3) we measure the norm of the subgradients of ft, which the player does not knowuntil round t+ 1, with the normk·k_(t), which the player has control over onlyup to round t. That is, the player has to pick a regularizer on roundt aiming to control the norm of the subgradient of the function she will get to know only on the next round, i.e., round t+ 1. In contrast, on Theorem 4.4.4 the subgradients offt are measured with the norm k·k_(t+1), which the player has some control over up to round t+ 1, since k·k_(t+1) on Theorem 4.4.4 depends on r1, . . . , rt+1, and rt+1 is chosen at round t+ 1 by the player. Thus, on the case of a proximal regularizer strategy, the player can craft the norm k·k_(t+1) with knowledge of ft whose subgradient norms she wants to control in order to get good regret guarantees. The implications of this, as we are going to see later in applications, is that AdaFTRL algorithms with proximal regularizer strategies may need less prior information about the functions the enemy will play in order to get good regret bounds.

With these bounds, one can already expect the bounds on the regret of FTRL algorithms to depend heavily on the (dual) norms of the subgradients of the functions given by the enemy. Thus, for these bounds to be meaningful, we may need to assume a bound on the norms of the subgradients.

Although such an assumption may seem artificial at first glance, it happens to be somewhat natural due to an interesting connection with Lipschitz continuity (see Theorem 3.8.4), the latter being a traditional hypothesis in convergence proofs of many optimization algorithms. One may note at least one of the reasons why Lipschitz continuity may be a sensible assumption: if the function can change drastically between two close points, intuitively, the algorithm will have a harder time optimizing over this function. Thus, if the functions played by the enemy areρ-Lipschitz continuous, and this is usually the case in the cases studied in the next sections, most of the subgradients of the functions have dual norm bounded byρ.

7We say “partially” here because the strong convexity parameter depends also on the functions played by the enemy.

However, if the regularizer increments are strongly convex (which is usually the case), summing convex functions to these regularizers preserves the strong convexity property. Thus, one may ignore the “partially” in this sentence to build intuition.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 93-98)