The Classical FTRL Algorithm - Online Convex Optimization: Algorithms, Learning, and Duality

have that µR is a (µσ)-strong classical FTRL regularizer strategy. This will be useful to obtain optimal constants in the final regret bounds for the classical FTRL. The only condition from the definition of classical FTRL regularizer which might not hold for a positive multiple of the regularizer is condition (4.4.iii). In general, it is not clear if multiplying the regularizer by a positive constant affects the attainability of the infimum in some case of (4.4.iii). Fortunately, if R is strongly convex, the infimum will still be attained due to Lemma 3.9.14, which states that the infimum of closed strongly convex functions is always attained.

Lemma 4.5.2. Let C:= (X,F) be an OCO instance such that eachf ∈ F is proper and closed, let σ∈R++, and let R:E→(−∞,+∞]be a σ-strong classical FTRL regularizer for C. Then, for any µ∈R++ we have thatµR is a (µσ)-strong FTRL regularizer strategy forC.

Proof. Letµ∈R++and setR⁰ :=µR. Let us first show thatR⁰ is a classical FTRL regularizer forC.

Since R is closed, proper, and convex (property (4.4.i) of a classical FTRL regularizer strategy), then so is R⁰ since µ >0. Moreover, domR⁰ = domR ⊆X, that is, R⁰ satisfies property (4.4.ii).

Let T ∈ N and let f ∈ F^T. Note that F := R⁰ +PT

t=1f_t is closed since the sum of convex and closed functions is also closed by Theorem 3.2.7, and it is proper since R+PT

t=1f_t is proper and sincedomF = dom(R+PT

t=1f_t). Finally, since R is strongly convex and µis positive, F is strongly convex. Thus, by Lemma 3.9.14 we have thatinfx∈EF(x) is attained, that is,R⁰ satisfies condition (4.4.iii) from the definition of classical FTRL regularizer or C.

Let us now show thatR⁰ is(µσ)-strong. Sinceµis positive, it is clear that R⁰ is(µσ)-strongly convex. Moreover, sincedomR⁰ = domR, we have thatR⁰ clearly satisfies condition (ii) from the definition of (µσ)-strongness for a classical FTRL regularizer of C.

Corollary 4.5.3 (Derived from Theorem 4.4.3). Let C := (X,F) be an OCO instance such that each f ∈ F is proper and closed. Let R:E→(−∞,+∞]be a σ-strong classical FTRL regularizer for C. Let T ∈N, letENEMY be an enemy oracle forC, and define

(x,f) := OCOC(FTRL_R,ENEMY, T).

Finally, letgt∈∂ft(xt) for eacht∈[T]. Then x∈Seq(X)and Regret(FTRL_R,f, u)≤R(u)−min

x∈E

R(x) + 1 2σ

t=1

kg_tk²_∗, ∀u∈E. (4.17) In particular, if every function inF isρ-Lipschitz continuous w.r.t.k·kon a convex setD⊇X with nonempty interior⁸ and there isθ∈R++such that⁹ θ≥sup{R(x)−R(y) : x∈X, y∈X∩domR}, then

Regret_T(FTRL_R⁰,ENEMY, X)≤ρ r2θT

σ , whereR⁰ := ρ√

T /√ 2σθ

Proof. Note that FTRLR = AdaFTRLR where R is given by R(f) := [f =hi]R for every f ∈ Seq((−∞,+∞]^E). Moreover, sinceR is a σ-strong FTRL regularizer,R is a σ-strong regularizer strategy for f w.r.t. k·k, where σ := hσ, . . . , σi ∈ R^T. Therefore, the first inequality is a direct application of Theorem 4.4.3 together with the fact that R(x1) = minx∈ER(x).

8Nonempty interior is need only for us to apply Theorem 3.8.4 to bound the dual norms of the subgradients.

9One may think of this value as the diameter of the setX measured through the lens of R.

If each f ∈ F is ρ-Lipschitz continuous w.r.t. k·k on a convex set D ⊇ X with nonempty interior, then by Theorem 3.8.4 we have that, for each f ∈ F and x ∈ X, there is g ∈ ∂f(x) such thatkgk_∗ ≤ρ. Using such subgradients with bounded dual norm in (4.17) and the fact that minx∈ER(x) = minx∈XR(x) yields

Regret_T(FTRLR,ENEMY, u)≤R(u)−min

x∈XR(x) +T ρ²

2σ . (4.18)

Moreover, suppose there is θ∈R++ such thatθ≥sup{R(x)−R(y) : x∈X, y∈X∩domR}, and define

R⁰ := ρ√

√ T 2σθR.

Note that R⁰ is a(ρ√ σT /√

2θ)-strong classical FTRL regularizer by Lemma 4.5.2. Thus, plugging R⁰ into the above inequality yields, for every u∈X,

Regret_T(FTRL_R⁰,ENEMY, u)≤ ρ√

√ T

2σθ(R(u)−min

x∈XR(x)) +ρ√

√θT

2σ ≤ ρ√

√θT

2σ +ρ√

√θT 2σ =ρ

r2θT σ , where in the second inequality we took the supremum overu∈X.

Let us look at the problem of prediction with expert advice. As we have seen on Chapter 2 (namely, on Proposition 2.6.2), in order to obtain a low-expected-regret randomized player oracle for the experts’ problem instance (AÊ, Y, A, L), it suffices to devise a player oracle for the OCO instanceC:= (∆_E,F), where the setF is given byF :={p∈RÊ 7→y^Tp:y∈[−1,1]Ê}. The next proposition shows that FTRL with`₂-regularization in any instance of the randomized experts has low expected regret.

Proposition 4.5.4. Define the OCO instance C := (∆_E,F), where E is a finite set and F :=

{p∈R^E 7→y^Tp:y∈[−1,1]^E}. Setd:=|E|and defineR:= ¹₂k·k²₂+δ(· |∆E). Moreover, letT ∈N and define R⁰ :=√

dT R. Then, for every enemy oracleENEMY for C we have Regret_T(FTRL_R,ENEMY,∆_E)≤√

dT . Proof. First, let us show that

R is a1-strong FTRL regularizer for C w.r.t.k·k₂. (4.19) We have that ¹₂k·k is closed (in fact, continuous) and thatδ(· |∆_E) is closed, the latter since∆_E is closed. Thus, R is a sum of closed functions and, hence, closed by Theorem 3.2.7, which means thatR satisfies condition (4.4.i) of a FTRL regularizer. Moreover, clearly domR ⊆∆E, that is,R satisfies condition (4.4.ii). Let T⁰ ∈Nandf ∈ F^T⁰. Let us show that inf_x∈_Rd(R(x) +PT⁰

t=1f_t(x)) is attained. First, notice that since the `2-norm is induced by the euclidean inner product, by Lemma 3.9.5 we know that ¹₂k·k²₂ is 1-strongly convex onR^E w.r.t.k·k₂, which implies that so is

2k·k²₂+δ(· |X) =R. Therefore,R+PT⁰

t=1ftis also strongly convex. It is also proper and closed, the latter by Theorem 3.2.7 since it is the sum of closed functions. Thus, by Lemma 3.9.14 we have that the infimum ofR+PT⁰

t=1f_t overR^E is attained. Therefore,R satisfies condition (4.4.iii), and we conclude that R is a classical FTRL regularizer. To see thatR is a 1-strong FTRL regularizer, note first thatR is1-strongly convex w.r.t. k·k₂ again by Lemma 3.9.5. Finally, since every function inF is linear, we have that domf is the entire space for anyf ∈ F. SinceR is proper, this implies that

ri(dom(R+PT⁰−1

t=1 f_t))∩ri(domf_T⁰)is nonempty for every f ∈ F^T⁰ and any T⁰∈N. We conclude thatR is 1-strong w.r.t.k·k₂, which proves (4.19).

Let show now that

every function in F is

√

d-Lipschitz continuous onRÊ w.r.t.k·k₂. (4.20) Lety∈[−1,1]Ê and definefy(x) :=y^Tx for everyx∈[−1,1]Ê. By the definition of dual norm, for everyu, v∈RÊ and for every normk·kon RÊ we have

|f_y(u)−fy(v)|=|y^T(u−v)| ≤ kyk∗ku−vk.

Since the`₂-norm is self-dual and since kyk₂≤√

dfor every y∈[−1,1]^E, from the above inequality we conclude that every function inFis√

d-Lipschitz continuous w.r.t.k·k₂onR^E, which proves (4.20).

Finally, let us show that

sup

x,y∈∆_E

(R(x)−R(y))≤ 1

2. (4.21)

Indeed, note that

sup

x∈∆E

R(x) = 1 2 sup

x∈∆E

x^Tx≤ 1 2 sup

x∈∆E

1^Tx= 1 2.

This together with the fact that R(x)≥0for any x∈R^E proves (4.21). Since by definition we have R⁰ =√

dT R= ρ√

√ T 2θR, where ρ:=√

dis the Lipschitz constant from (4.20) andθ:= ¹₂ is from (4.21), by Corollary 4.5.3 we have, for every enemy oracleENEMY for C,

Regret_T(FTRL_R⁰,ENEMY,∆_E)≤√

dT . (4.22)

It is natural to ask if this regret bound is optimal, especially since our choice of regularizer was mainly due to the self-duality of the `2-norm for the sake of simplicity. It turns out that the dependence onT on the bound given by Corollary 4.5.3 is optimal: there is a class of OCO instances of the type (X,F⁰), with each function inF⁰ being Lipschitz continuous, such that the worst-case regret inT rounds of any player oracle in such instance is no better thanΩ(√

T), where the constants hidden by the asymptotic notation may depend on other parameters of the instance, such as the dimension [2]. The fact that such an intuitive algorithm already attains optimal regret asymptotically (w.r.t.T) is surprising. Still, this lower bound says nothing about the dependence on the dimension, which can be high, even more so in machine learning applications.

However, a smarter choice of regularizer already improves exponentially the dependence of the regret bound for FTRL on the number of the experts, which can be seen as the dimension of the problem.

Proposition 4.5.5. Define the OCO instance C := (∆_E,F), where E is a finite set and F :=

{p∈R^E 7→y^Tp:y∈[−1,1]^E}. Set d := |E|, define R(x) := P

i∈E[x_i 6= 0]x_ilnx_i+δ(x|∆_E) for every x ∈R^E, let T ∈ N, and set R⁰ := (p

T /(2 lnd))R. Then, for every enemy oracle ENEMY for C we have

Regret_T(FTRL_R⁰,ENEMY,∆_E)≤p

2(lnd)T .

Proof. First, let us show that

R is a1-strong FTRL regularizer for C w.r.t.k·k₁. (4.23) By Lemma 3.9.10 and since∆E is closed, we know thatR is proper, closed, and convex, that is, it satisfies condition (4.4.i) from the definition of FTRL regularizer. Moreover, we clearly have domR⊆∆_E, which means thatR satisfies condition (4.4.ii). To show that (4.4.iii) holds, letT⁰ ∈N and letf ∈ F^T⁰. Since each function inF is closed, we have thatF :=R+PT⁰

t=1ft is the sum of closed functions and, thus, closed by Theorem 3.2.7. Moreover, by Lemma 3.9.10 we know thatR, and thusF, are strongly convex. Finally, by Lemma 3.9.14 we have thatinf_x∈_RdF(x) is attained, which proves thatRis a classical FTRL regularizer forC. Let us prove that it is a1-strong regularizer for C w.r.t. the `1-norm. Indeed, by Lemma 3.9.10 one more time we know that R is 1-strongly convex w.r.t.k·k₁. Additionally, by the definition of F we have domf_t=R^d for anyt∈[T⁰]. Since R is proper, this implies thatri(dom(R+Pt−1

i=1f_i))∩ri(domf_t) is nonempty for everyt∈[T⁰]. This completes the proof of (4.23).

Note now that, since the dual norm of k·k₁ is k·k_∞, by Hölder’s inequality we have, for every y∈[−1,1]^E and every u, v∈R^E,

|y^Tu−y^Tv|=|y^T(u−v)| ≤ kyk_∞ku−vk₁ ≤ ku−vk₁.

Thus, we conclude that every function in F is1-Lipschitz continuous w.r.t.k·k₁ onR^E. To conclude, let us show that

sup

x,y∈∆_E

(R(x)−R(y))≤lnd. (4.24)

First, since [α >0]αlnα ≤0 for every α∈[0,1], we have sup_x∈∆_ER(x)≤0. Thus, we need only show that infy∈∆_ER(y)is attained byd⁻¹1. Indeed, note that for everyx∈∆_E we have

−∇R(d⁻¹1)^T(x−d⁻¹1) =−

1+X

i∈E

e_ilnd⁻¹T

(x−d⁻¹1)

=−(1−(lnd)1)^T(x−d⁻¹1)

= (1−lnd)(1−1) = 0.

That is, −∇R(d⁻¹1)∈N_∆_E(d⁻¹1). By the optimality conditions from Theorem 3.6.2 we conclude thatinfy∈∆ER(y) =R(d⁻¹1) =−lnd, which proves (4.24). Since

R⁰ = r T

2 lnd= ρ√

√ T 2θR,

whereρ:= 1 andθ:= lnd, by Corollary 4.5.3 we have, for every enemy oracleENEMY for C, Regret_T(FTRL_R⁰,ENEMY,∆_E)≤p

2(lnd)T .

It is interesting to try to understand the intuition behind the difference between the regret bounds given by the entropic regularizer (which is strongly convex w.r.t. the `₁-norm) and the squared `₂ regularizer on the prediction with expert advice problem withd experts. Notice that in the case of`2 regularization, even though the “diameter” of the set where the player is making her choices (i.e., the simplex in this case) is less than1/2, the functions the enemy can pick behave badly under the lens of the the`₂ norm. Namely, the functions played by the enemy on the experts’

problem are √

d-Lipschitz continuous w.r.t.k·k₂. In the case of the entropic regularizer, we have

that the functions behave way better w.r.t. the `₁ norm: they are 1-Lipschitz continuous on the simplex w.r.t. the`₁-norm. However, this improvement on the Lipschitz constant is not for free: the diameter of the simplex through the lens of the entropic regularizer islnd, not a constant anymore if compared to d. Still, in this case the trade-off is quite advantageous. Thus, when looking for FTRL regularizersR for an OCO instance C:= (X,F), the intuition that one should balance two factors.

The first is the diameter of X through the lens ofR, that is, any two points insideX should not have values of R which are too far away. At the same time, this regularizer is usually associated with a normk·k with respect to whichR is strongly convex. In this case, one wants the functions played by the enemy to be “well-behaved” underk·k, that is, to have small Lipschitz constant w.r.t.

k·k. To study the Lipschitz constant of the functions from F, it is usually useful to look at the dual norms of the subgradients since, for any x, y∈Eand any convex functionf:E→(−∞,+∞] which is subdifferentiable at x, the subgradient inequality yields, for any g∈∂f(x),

f(x)−f(y)≤ hg, x−yi ≤ kgk_∗kx−yk.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 98-103)