• Nenhum resultado encontrado

have that µR is a (µσ)-strong classical FTRL regularizer strategy. This will be useful to obtain optimal constants in the final regret bounds for the classical FTRL. The only condition from the definition of classical FTRL regularizer which might not hold for a positive multiple of the regularizer is condition (4.4.iii). In general, it is not clear if multiplying the regularizer by a positive constant affects the attainability of the infimum in some case of (4.4.iii). Fortunately, if R is strongly convex, the infimum will still be attained due to Lemma 3.9.14, which states that the infimum of closed strongly convex functions is always attained.

Lemma 4.5.2. Let C:= (X,F) be an OCO instance such that eachf ∈ F is proper and closed, let σ∈R++, and let R:E→(−∞,+∞]be a σ-strong classical FTRL regularizer for C. Then, for any µ∈R++ we have thatµR is a (µσ)-strong FTRL regularizer strategy forC.

Proof. Letµ∈R++and setR0 :=µR. Let us first show thatR0 is a classical FTRL regularizer forC.

Since R is closed, proper, and convex (property (4.4.i) of a classical FTRL regularizer strategy), then so is R0 since µ >0. Moreover, domR0 = domR ⊆X, that is, R0 satisfies property (4.4.ii).

Let T ∈ N and let f ∈ FT. Note that F := R0 +PT

t=1ft is closed since the sum of convex and closed functions is also closed by Theorem 3.2.7, and it is proper since R+PT

t=1ft is proper and sincedomF = dom(R+PT

t=1ft). Finally, since R is strongly convex and µis positive, F is strongly convex. Thus, by Lemma 3.9.14 we have thatinfx∈EF(x) is attained, that is,R0 satisfies condition (4.4.iii) from the definition of classical FTRL regularizer or C.

Let us now show thatR0 is(µσ)-strong. Sinceµis positive, it is clear that R0 is(µσ)-strongly convex. Moreover, sincedomR0 = domR, we have thatR0 clearly satisfies condition (ii) from the definition of (µσ)-strongness for a classical FTRL regularizer of C.

Corollary 4.5.3 (Derived from Theorem 4.4.3). Let C := (X,F) be an OCO instance such that each f ∈ F is proper and closed. Let R:E→(−∞,+∞]be a σ-strong classical FTRL regularizer for C. Let T ∈N, letENEMY be an enemy oracle forC, and define

(x,f) := OCOC(FTRLR,ENEMY, T).

Finally, letgt∈∂ft(xt) for eacht∈[T]. Then x∈Seq(X)and Regret(FTRLR,f, u)≤R(u)−min

x∈E

R(x) + 1 2σ

T

X

t=1

kgtk2, ∀u∈E. (4.17) In particular, if every function inF isρ-Lipschitz continuous w.r.t.k·kon a convex setD⊇X with nonempty interior8 and there isθ∈R++such that9 θ≥sup{R(x)−R(y) : x∈X, y∈X∩domR}, then

RegretT(FTRLR0,ENEMY, X)≤ρ r2θT

σ , whereR0 := ρ√

T /√ 2σθ

R.

Proof. Note that FTRLR = AdaFTRLR where R is given by R(f) := [f =hi]R for every f ∈ Seq((−∞,+∞]E). Moreover, sinceR is a σ-strong FTRL regularizer,R is a σ-strong regularizer strategy for f w.r.t. k·k, where σ := hσ, . . . , σi ∈ RT. Therefore, the first inequality is a direct application of Theorem 4.4.3 together with the fact that R(x1) = minx∈ER(x).

8Nonempty interior is need only for us to apply Theorem 3.8.4 to bound the dual norms of the subgradients.

9One may think of this value as the diameter of the setX measured through the lens of R.

If each f ∈ F is ρ-Lipschitz continuous w.r.t. k·k on a convex set D ⊇ X with nonempty interior, then by Theorem 3.8.4 we have that, for each f ∈ F and x ∈ X, there is g ∈ ∂f(x) such thatkgk ≤ρ. Using such subgradients with bounded dual norm in (4.17) and the fact that minx∈ER(x) = minx∈XR(x) yields

RegretT(FTRLR,ENEMY, u)≤R(u)−min

x∈XR(x) +T ρ2

2σ . (4.18)

Moreover, suppose there is θ∈R++ such thatθ≥sup{R(x)−R(y) : x∈X, y∈X∩domR}, and define

R0 := ρ√

√ T 2σθR.

Note that R0 is a(ρ√ σT /√

2θ)-strong classical FTRL regularizer by Lemma 4.5.2. Thus, plugging R0 into the above inequality yields, for every u∈X,

RegretT(FTRLR0,ENEMY, u)≤ ρ√

√ T

2σθ(R(u)−min

x∈XR(x)) +ρ√

√θT

2σ ≤ ρ√

√θT

2σ +ρ√

√θT 2σ =ρ

r2θT σ , where in the second inequality we took the supremum overu∈X.

Let us look at the problem of prediction with expert advice. As we have seen on Chapter 2 (namely, on Proposition 2.6.2), in order to obtain a low-expected-regret randomized player oracle for the experts’ problem instance (AE, Y, A, L), it suffices to devise a player oracle for the OCO instanceC:= (∆E,F), where the setF is given byF :={p∈RE 7→yTp:y∈[−1,1]E}. The next proposition shows that FTRL with`2-regularization in any instance of the randomized experts has low expected regret.

Proposition 4.5.4. Define the OCO instance C := (∆E,F), where E is a finite set and F :=

{p∈RE 7→yTp:y∈[−1,1]E}. Setd:=|E|and defineR:= 12k·k22+δ(· |∆E). Moreover, letT ∈N and define R0 :=√

dT R. Then, for every enemy oracleENEMY for C we have RegretT(FTRLR,ENEMY,∆E)≤√

dT . Proof. First, let us show that

R is a1-strong FTRL regularizer for C w.r.t.k·k2. (4.19) We have that 12k·k is closed (in fact, continuous) and thatδ(· |∆E) is closed, the latter since∆E is closed. Thus, R is a sum of closed functions and, hence, closed by Theorem 3.2.7, which means thatR satisfies condition (4.4.i) of a FTRL regularizer. Moreover, clearly domR ⊆∆E, that is,R satisfies condition (4.4.ii). Let T0 ∈Nandf ∈ FT0. Let us show that infx∈Rd(R(x) +PT0

t=1ft(x)) is attained. First, notice that since the `2-norm is induced by the euclidean inner product, by Lemma 3.9.5 we know that 12k·k22 is 1-strongly convex onRE w.r.t.k·k2, which implies that so is

1

2k·k22+δ(· |X) =R. Therefore,R+PT0

t=1ftis also strongly convex. It is also proper and closed, the latter by Theorem 3.2.7 since it is the sum of closed functions. Thus, by Lemma 3.9.14 we have that the infimum ofR+PT0

t=1ft overRE is attained. Therefore,R satisfies condition (4.4.iii), and we conclude that R is a classical FTRL regularizer. To see thatR is a 1-strong FTRL regularizer, note first thatR is1-strongly convex w.r.t. k·k2 again by Lemma 3.9.5. Finally, since every function inF is linear, we have that domf is the entire space for anyf ∈ F. SinceR is proper, this implies that

ri(dom(R+PT0−1

t=1 ft))∩ri(domfT0)is nonempty for every f ∈ FT0 and any T0∈N. We conclude thatR is 1-strong w.r.t.k·k2, which proves (4.19).

Let show now that

every function in F is

d-Lipschitz continuous onRE w.r.t.k·k2. (4.20) Lety∈[−1,1]E and definefy(x) :=yTx for everyx∈[−1,1]E. By the definition of dual norm, for everyu, v∈RE and for every normk·kon RE we have

|fy(u)−fy(v)|=|yT(u−v)| ≤ kykku−vk.

Since the`2-norm is self-dual and since kyk2≤√

dfor every y∈[−1,1]E, from the above inequality we conclude that every function inFis√

d-Lipschitz continuous w.r.t.k·k2onRE, which proves (4.20).

Finally, let us show that

sup

x,y∈∆E

(R(x)−R(y))≤ 1

2. (4.21)

Indeed, note that

sup

x∈∆E

R(x) = 1 2 sup

x∈∆E

xTx≤ 1 2 sup

x∈∆E

1Tx= 1 2.

This together with the fact that R(x)≥0for any x∈RE proves (4.21). Since by definition we have R0 =√

dT R= ρ√

√ T 2θR, where ρ:=√

dis the Lipschitz constant from (4.20) andθ:= 12 is from (4.21), by Corollary 4.5.3 we have, for every enemy oracleENEMY for C,

RegretT(FTRLR0,ENEMY,∆E)≤√

dT . (4.22)

It is natural to ask if this regret bound is optimal, especially since our choice of regularizer was mainly due to the self-duality of the `2-norm for the sake of simplicity. It turns out that the dependence onT on the bound given by Corollary 4.5.3 is optimal: there is a class of OCO instances of the type (X,F0), with each function inF0 being Lipschitz continuous, such that the worst-case regret inT rounds of any player oracle in such instance is no better thanΩ(√

T), where the constants hidden by the asymptotic notation may depend on other parameters of the instance, such as the dimension [2]. The fact that such an intuitive algorithm already attains optimal regret asymptotically (w.r.t.T) is surprising. Still, this lower bound says nothing about the dependence on the dimension, which can be high, even more so in machine learning applications.

However, a smarter choice of regularizer already improves exponentially the dependence of the regret bound for FTRL on the number of the experts, which can be seen as the dimension of the problem.

Proposition 4.5.5. Define the OCO instance C := (∆E,F), where E is a finite set and F :=

{p∈RE 7→yTp:y∈[−1,1]E}. Set d := |E|, define R(x) := P

i∈E[xi 6= 0]xilnxi+δ(x|∆E) for every x ∈RE, let T ∈ N, and set R0 := (p

T /(2 lnd))R. Then, for every enemy oracle ENEMY for C we have

RegretT(FTRLR0,ENEMY,∆E)≤p

2(lnd)T .

Proof. First, let us show that

R is a1-strong FTRL regularizer for C w.r.t.k·k1. (4.23) By Lemma 3.9.10 and since∆E is closed, we know thatR is proper, closed, and convex, that is, it satisfies condition (4.4.i) from the definition of FTRL regularizer. Moreover, we clearly have domR⊆∆E, which means thatR satisfies condition (4.4.ii). To show that (4.4.iii) holds, letT0 ∈N and letf ∈ FT0. Since each function inF is closed, we have thatF :=R+PT0

t=1ft is the sum of closed functions and, thus, closed by Theorem 3.2.7. Moreover, by Lemma 3.9.10 we know thatR, and thusF, are strongly convex. Finally, by Lemma 3.9.14 we have thatinfx∈RdF(x) is attained, which proves thatRis a classical FTRL regularizer forC. Let us prove that it is a1-strong regularizer for C w.r.t. the `1-norm. Indeed, by Lemma 3.9.10 one more time we know that R is 1-strongly convex w.r.t.k·k1. Additionally, by the definition of F we have domft=Rd for anyt∈[T0]. Since R is proper, this implies thatri(dom(R+Pt−1

i=1fi))∩ri(domft) is nonempty for everyt∈[T0]. This completes the proof of (4.23).

Note now that, since the dual norm of k·k1 is k·k, by Hölder’s inequality we have, for every y∈[−1,1]E and every u, v∈RE,

|yTu−yTv|=|yT(u−v)| ≤ kykku−vk1 ≤ ku−vk1.

Thus, we conclude that every function in F is1-Lipschitz continuous w.r.t.k·k1 onRE. To conclude, let us show that

sup

x,y∈∆E

(R(x)−R(y))≤lnd. (4.24)

First, since [α >0]αlnα ≤0 for every α∈[0,1], we have supx∈∆ER(x)≤0. Thus, we need only show that infy∈∆ER(y)is attained byd−11. Indeed, note that for everyx∈∆E we have

−∇R(d−11)T(x−d−11) =−

1+X

i∈E

eilnd−1T

(x−d−11)

=−(1−(lnd)1)T(x−d−11)

= (1−lnd)(1−1) = 0.

That is, −∇R(d−11)∈NE(d−11). By the optimality conditions from Theorem 3.6.2 we conclude thatinfy∈∆ER(y) =R(d−11) =−lnd, which proves (4.24). Since

R0 = r T

2 lnd= ρ√

√ T 2θR,

whereρ:= 1 andθ:= lnd, by Corollary 4.5.3 we have, for every enemy oracleENEMY for C, RegretT(FTRLR0,ENEMY,∆E)≤p

2(lnd)T .

It is interesting to try to understand the intuition behind the difference between the regret bounds given by the entropic regularizer (which is strongly convex w.r.t. the `1-norm) and the squared `2 regularizer on the prediction with expert advice problem withd experts. Notice that in the case of`2 regularization, even though the “diameter” of the set where the player is making her choices (i.e., the simplex in this case) is less than1/2, the functions the enemy can pick behave badly under the lens of the the`2 norm. Namely, the functions played by the enemy on the experts’

problem are √

d-Lipschitz continuous w.r.t.k·k2. In the case of the entropic regularizer, we have

that the functions behave way better w.r.t. the `1 norm: they are 1-Lipschitz continuous on the simplex w.r.t. the`1-norm. However, this improvement on the Lipschitz constant is not for free: the diameter of the simplex through the lens of the entropic regularizer islnd, not a constant anymore if compared to d. Still, in this case the trade-off is quite advantageous. Thus, when looking for FTRL regularizersR for an OCO instance C:= (X,F), the intuition that one should balance two factors.

The first is the diameter of X through the lens ofR, that is, any two points insideX should not have values of R which are too far away. At the same time, this regularizer is usually associated with a normk·k with respect to whichR is strongly convex. In this case, one wants the functions played by the enemy to be “well-behaved” underk·k, that is, to have small Lipschitz constant w.r.t.

k·k. To study the Lipschitz constant of the functions from F, it is usually useful to look at the dual norms of the subgradients since, for any x, y∈Eand any convex functionf:E→(−∞,+∞] which is subdifferentiable at x, the subgradient inequality yields, for any g∈∂f(x),

f(x)−f(y)≤ hg, x−yi ≤ kgkkx−yk.