Regularization Regardless of the Number of Rounds

that the functions behave way better w.r.t. the `₁ norm: they are 1-Lipschitz continuous on the simplex w.r.t. the`₁-norm. However, this improvement on the Lipschitz constant is not for free: the diameter of the simplex through the lens of the entropic regularizer islnd, not a constant anymore if compared to d. Still, in this case the trade-off is quite advantageous. Thus, when looking for FTRL regularizersR for an OCO instance C:= (X,F), the intuition that one should balance two factors.

The first is the diameter of X through the lens ofR, that is, any two points insideX should not have values of R which are too far away. At the same time, this regularizer is usually associated with a normk·k with respect to whichR is strongly convex. In this case, one wants the functions played by the enemy to be “well-behaved” underk·k, that is, to have small Lipschitz constant w.r.t.

k·k. To study the Lipschitz constant of the functions from F, it is usually useful to look at the dual norms of the subgradients since, for any x, y∈Eand any convex functionf:E→(−∞,+∞] which is subdifferentiable at x, the subgradient inequality yields, for any g∈∂f(x),

f(x)−f(y)≤ hg, x−yi ≤ kgk_∗kx−yk.

Soon we will prove thatDOUBLING^C_PFAMILY performs well in games with any number of rounds.

Let us look more closely at Algorithm 4.4. The idea of the algorithm is to divide the rounds of the game into intervals, each with double the size of the previous one. The sizes of these intervals are, in some sense, estimates of the duration of the game. With that, the DOUBLINGoracle starts to play with a player oracle for a small number of rounds (the number of rounds of the first interval). If the game goes beyond the timeframe given by the interval of rounds the DOUBLINGoracle defined last, the oracle doubles its estimate of the number of rounds, and starts to play from scratch, using a brand new player oracle. That is, the oracle does not use the information of the functions given by the enemy in the past round intervals.

The reason to not use the oracle with full information is that, in order to use the player oracle for T functions f₁, . . . , f_T, one usually needs to compute the point given by the player oracle for f1, . . . , ft for each t∈[T]. In an OCO game this is natural since the oracle receives only one new function per round. However, the DOUBLING oracle picks a brand new player oracle on each different section, and it would be inefficient¹⁰ to compute all the points this new player oracle would have played in previous rounds in order to use this new oracle with complete information in the next rounds. In Algorithm 4.4, the number T⁰ is the last round on which the DOUBLING oracle re-started, doubling its estimate of the number of rounds. A nice property of the this strategy is that T⁰ in Algorithm 4.4 is also the size of the current section of rounds which the DOUBLING oracle is considering.

Maybe surprisingly, if the original time-dependent player oracle has a O(√

T) regret bound in a game with T rounds, the next theorem shows a regret bound for theDOUBLING oracle which is worse by only a constant factor.

Theorem 4.6.1. Let C := (X,F) be an OCO instance and let PFAMILY : N → E^Seq(F) be a function such that PFAMILY_T is a player oracle for C for each T ∈ N. If there are U ⊆ E and α∈R+ such that, for every T ∈Nand every enemy oracleENEMY forC,

Regret_T(PFAMILY_T,ENEMY, U)≤α

√ T , then, for everyT ∈Nand every enemy oracle ENEMYfor C we have

Regret_T(DOUBLINGPFAMILY,ENEMY, U)≤

√

√ 2 2−1

! α

√ T . Proof. LetENEMY be an enemy oracle forC. Moreover, letT ∈Nand define

(x,f) := OCOC(DOUBLING_PFAMILY,ENEMY, T).

Note that T ≤2^blg^T^c+1−1. Set T⁰ := 2^blg^T^c+1−1 and define¹¹ f⁰ ∈(F ∪ {0})^T⁰ by f_i⁰ := [i≤T]fi

for eachi∈[T⁰]. In words,f⁰ is just f extended with zeroes. In this case, note that

Regret(DOUBLINGPFAMILY,f, u)≤Regret(DOUBLINGPFAMILY,f⁰, u), ∀u∈E. Thus, we may assume without loss of generality that T = 2^blg^T^c+1−1.

10Inefficient here is about practical implementations. All oracles in this text are defined in such a way that they

“re-compute” all the previous iterates at every round. Still, often these oracles need little effort to generate one iterate given the past ones. This would not be the case for theDOUBLINGplayer if it had to re-compute, even in practice, all the past iterates once he changes his player oracle.

11Here we are using0to denote the identically zero function onE.

To ease the notation, define p(n) := 2ⁿfor every n∈N. Recall that, by our notation definition, f_i:j =hf_i, f_i+1, . . . , f_ji. Then, by the definition of theDOUBLING oracle, for anyu∈U,

Regret(DOUBLING_PFAMILY,f, u) =

blgTc

i=0

Regret_p(i)(PFAMILY_p(i),f_{p(i) :}_p(i+1)−1, u)

≤α

blgTc

i=0

pp(i) =α

blgTc

i=0

√ 2i

=α

√2^blg^T^c+1−1

√2−1

≤α

√ 2T−1

√2−1

≤α

√ T

√

√ 2 2−1

! .

Even though the Doubling Trick does guarantee regret bounds only a multiplicative constant worse than the bound from Corollary 4.5.3, re-starting the algorithm several times seems wasteful.

What we can do instead is to use the AdaFTRL algorithm with a regularizer strategy that uses always the same regularizer function, but with a different constant multiplying it at every round.

That is, we are still using a static function as our main regularizer, but at each round we adjust the constants multiplying it so that it takes into account the duration of the game without the need of re-starting the whole algorithm. Before jumping into Corollary 4.5.3, we need to prove a simple lemma.

Lemma 4.6.2. Leta1. . . , an∈R+ witha1>0. Then,

i=1



 ai

q Pi

j=1a_j



≤2 v u u t

i=1

ai.

Proof. The proof is by induction on n. The statement holds trivially forn = 1. Let n >1, and defines:=Pn

i=1ai. By the induction hypothesis,

i=1

a_i q

Pi j=1aj

≤2 v u u t

n−1

i=1

a_i+ a_n qPn

j=1a_j

= 2√

s−a_n+ a_n

√s.

Finally, note that 2√

s−an+ √an

s ≤2√

s ⇐⇒ 2p

s(s−an)≤2s−an ⇐⇒ 4s(s−an)≤(2s−an)²

⇐⇒ 4s²−4san≤4s²−4san+a²_n ⇐⇒ 0≤a²_n.

Corollary 4.6.3 (Derived from Theorem 4.4.3). Let C := (X,F) be an OCO instance such that eachf ∈ F is proper and closed. LetR:E→(−∞,+∞] be a1-strong FTRL regularizer forC w.r.t.

a norm k·k on E, and supposeµR is also a classical FTRL strategy for C for any µ ∈R++. Let η:N\ {0} →R++ and define the regularizer strategyR: Seq(F)→(−∞,+∞]^E by, for each t∈N,

R(f) :=

η_t+1 −[t >0]1 η_t

R, ∀f ∈ F^t. LetT ∈Nand letENEMY be an enemy oracle forC. Define

(x,f) := OCOC(AdaFTRLR,ENEMY, T).

Finally, letg_t∈∂f_t(x_t) for eacht∈[T]and defineσ ∈R^T by σ_t:=η⁻¹_t for eacht∈[T]. Then,R is a FTRL regularizer strategy forC which is σ-strong forf w.r.t.k·k,x∈Seq(X) and, for every u∈X,

Regret(AdaFTRLR,f, u)≤

t=1

η_t −[t >1] 1 ηt−1

(R(u)−R(x_t)) +1 2

t=1

η_tkg_tk²_∗. (4.25) In particular, consider the case where every function in F is ρ-Lipschitz continuous w.r.t. k·kon a convex set D ⊇ X with nonempty interior and there is θ ∈ R++ such that it holds that θ ≥ sup{R(x)−R(y) : x∈X, y∈X∩domR}. If we define

η_t:= 1 ρ

rθ

t, ∀t∈N\ {0}, (4.26)

then,

Regret(AdaFTRLR,f, X)≤2ρ√ θT .

Proof. Definer_t:=R(hf₁, . . . , ft−1i) for each t∈[T]. Note that, for each t∈[T], the function

i=1

ri+

i=1

fi= 1 ηt

R+

i=1

is(1/η_t)-strongly convex w.r.t. k·k. Therefore,Ris a σ-strong FTRL regularizer strategy forf (the other necessary properties are easily implied by the fact that µRis a classical FTRL regularizer for any µ∈R++). Therefore, (4.25) andx∈Seq(X) follow directly from Theorem 4.4.3.

Suppose that every f ∈ F is ρ-Lipschitz continuous w.r.t. k·k on a convex set D ⊇ X with nonempty interior, that there is θ ∈R++ as in the statement, and that η is given by (4.26). By Theorem 3.8.4, for each t∈[T]there is gt∈∂ft(xt) such thatkg_tk_∗ ≤ρ. Therefore, by (4.25), for everyu∈X we have

Regret(AdaFTRLR,f, u)≤θ

t=1

1 ηt

−[t >1] 1 ηt−1

+ρ²

t=1

η_t

=ρ

√

θT+ ρ√ θ 2

t=1

√1 t ≤2ρ

√ θT , where in the last inequality we have used Lemma 4.6.2.

The above regret bound is a√

2multiplicative factor worse than the bound given by Corollary 4.5.3.

Yet, this bound holds at every round of the game, without the need of any prior knowledge on the number of rounds. Still, note that we need to know the Lipschitz constants of the functions in order to use the above regularizer strategies.

Even though we know the Lipschitz constant in some important examples, such as in the expert’s problem, there are many cases where we do not have this information. Not only that, but even in the cases in which we do know the Lipschitz constant, the enemy may pick many functions that have subgradients with small dual norm, far from the upper bound given by the Lipschitz constant. We may hope that, if the regularizer strategy could “notice” and adapt to subgradients with small norm, the algorithm would perform better in these “easy” cases. On Chapter 6 we will investigate this idea.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 103-107)