A new look at shifting regret

(1)

HAL Id: hal-00670514

https://hal.archives-ouvertes.fr/hal-00670514v1

Preprint submitted on 15 Feb 2012 (v1), last revised 27 Sep 2012 (v2)

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A new look at shifting regret

Nicolò Cesa-Bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

To cite this version:

Nicolò Cesa-Bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz. A new look at shifting regret. 2012.

�hal-00670514v1�

(2)

A New Look at Shifting Regret

Nicolò Cesa-Bianchi NICOLO.CESA-BIANCHI@UNIMI.IT

DSI, Università degli Studi di Milano

Pierre Gaillard PIERRE.GAILLARD@ENS.FR

Ecole Normale Supérieure^∗, Paris

Gábor Lugosi GABOR.LUGOSI@UPF.EDU

ICREA & Universitat Pompeu Fabra

Gilles Stoltz GILLES.STOLTZ@ENS.FR

Ecole Normale Supérieure^†, Paris & HEC Paris, Jouy-en-Josas, France

Abstract

We investigate extensions of well-known online learning algorithms such as fixed-share ofHerbster and Warmuth (1998) or the methods proposed byBousquet and Warmuth(2002). These algorithms use weight

sharing schemes to perform as well as the best sequence of experts with a limited number of changes. Here we show, with a common, general, and simpler analysis, that weight sharing in fact achieves much more than what it was designed for. We use it to simultaneously prove new shifting regret bounds for online convex optimization on the simplex in terms of the total variation distance as well as new bounds for the related setting of adaptive regret. Finally, we exhibit the first logarithmic shifting bounds for exp-concave loss functions on the simplex.

Keywords: Prediction with expert advice, online convex optimization, tracking the best expert, shifting experts.

1. Introduction

Online convex optimization is a sequential prediction paradigm in which, at each time step, the learner chooses an element from a fixed convex set S and then is given access to a convex loss function defined on the same set. The value of the function on the chosen element is the learner’s loss. Many problems such as prediction with expert advice, sequential investment, and online re- gression/classification can be viewed as special cases of this general framework. Online learning algorithms are designed to minimize the regret. The standard notion of regret is the difference between the learner’s cumulative loss and the cumulative loss of the single best element inS. A much harder criterion to minimize is shifting regret, which is defined as the difference between the learner’s cumulative loss and the cumulative loss of an arbitrary sequence of elements inS. Shift- ing regret bounds are typically expressed in terms of theshift, a notion of regularity measuring the length of the trajectory inS described by the comparison sequence (i.e., the sequence of elements against which the regret is evaluated).

In online convex optimization, shifting regret bounds for convex subsets S ⊆R^dare obtained for the online mirror descent (or follow-the-regularized-leader) algorithm. In this case the shift is

†. CNRS – Ecole normale supérieure, Paris – INRIA, within the project-team CLASSIC

(3)

typically computed in terms of the p-norm of the difference of consecutive elements in the comparison sequence —seeHerbster and Warmuth(2001) andCesa-Bianchi and Lugosi(2006). In this paper we focus on the important special case when S is the simplex, and investigate the online mirror descent with entropic regularizers. This family includes popular algorithms such as expo- nentially weighted average (EWA), Winnow, and exponentiated gradient. Proving general shifting bounds in this case is difficult due to the behavior of the regularizer at the boundary of the simplex.

Herbster and Warmuth(2001) show shifting bounds for mirror descent with entropic regularizers using a1-norm to measure the shift. In order to keep mirror descent from choosing points too close to the simplex boundary, they use a complex dynamic projection technique. When the comparison sequence is restricted to the corners of the simplex (which is the setting of prediction with expert advice), then the shift is naturally defined to be the number times the trajectory moves to a different corner. This problem is often called “tracking the best expert” —see, e.g.,Herbster and Warmuth (1998);Vovk(1999);Herbster and Warmuth(2001);Bousquet and Warmuth(2002);György et al.

(2005), and it is well known that EWA with weight sharing, which corresponds to the fixed- share algorithm of Herbster and Warmuth (1998), achieves a good shifting bound in this setting.

Bousquet and Warmuth(2002) introduce a generalization of the fixed-share algorithm, and prove various shifting bounds for any trajectory in the simplex. However, their bounds are expressed using a quantity that corresponds to a proper shift only for trajectories on the simplex corners.

Our analysis unifies, generalizes (and simplifies) the previously quite different proof techniques and algorithms used in Herbster and Warmuth (1998) and Bousquet and Warmuth (2002). Our bounds are expressed in terms of a notion of shift based on the total variation distance. The generalization of the “small expert set” result inBousquet and Warmuth(2002) leads us to obtain better bounds when the sequence against which the regret is measured is sparse. When the trajectory is restricted to the corners of the simplex, we recover, and occasionally improve, the known shifting bounds for prediction with expert advice. Besides, our analysis also captures the setting of adaptive regret, a related notion of regret introduced by Hazan and Seshadhri (2009). It was known that shifting regret and adaptive regret had some connections but this connection is now seen to be even tighter, as both regrets can be viewed as instances of the samealma mater regret, which we minimize. Finally, we also show how to dynamically tune the parameters of our algorithms and re- view briefly the special case of exp-concave loss functions, exhibiting the first logarithmic shifting bounds for exp-concave loss functions on the simplex.

2. Preliminaries

We first define the sequential learning framework we work with. Even though our results hold in the general setting of online convex optimization, we present them in the, somewhat simpler,online linear optimization setup. We point out in Section6how these results may be generalized. Online linear optimization may be cast as a repeated game between theforecaster and the environment as follows. We use∆_dto denote the simplex

q∈[0,1]^d : kqk¹ = 1 . Online linear optimization. For each roundt= 1, . . . , T,

1. Forecaster choosesbp_t= (pb_1,t, . . . ,pb_d,t)∈∆_d;

2. Environment chooses a loss vectorℓ_t= (ℓ_1,t, . . . , ℓ_d,t)∈[0,1]^d; 3. Forecaster suffers lossbp^⊤_t ℓ_t.

(4)

The goal of the forecaster is to minimize the accumulated lossLbT = PT

t=1bp^⊤_tℓ_t. In the now classical problem of prediction with expert advice, the goal of the forecaster is to compete with the best fixed component (often called “expert”) chosen in hindsight, that is, withmin_i=1,...,TPT

t=1ℓ_i,t. The focus of this paper is on more ambitious forecasters that compete with a richer class of sequences of components. Let[d] = {1, . . . , d}. We usei^T₁ = (i₁, . . . , i_T)to denote a sequence in [d]^T and letL_T(i^T₁) =PT

t=1ℓ_i_tbe the cumulative linear loss of the sequencei^T₁ ∈[d]^T.

We start by introducing our main algorithmic tool, a generalized share algorithm. It is parametrized by the “mixing functions”ψ_t: [0,1]^(t+1)d → ∆_dfort= 1, . . . , T that assign probabilities to past

“pre-weights” as defined below. In all examples discussed in this paper, these mixing functions are quite simple but working with such a general model makes the main ideas more transparent. We then provide a simple lemma that serves as the starting point for analyzing different instances of the generalized share algorithm.

Algorithm 1: The generalized share algorithm.

Parameters: learning rateη >0and mixing functionsψ_tfort= 1, . . . , T Initialization: bp₁ =v₁ = (1/d, . . . ,1/d)

For each roundt= 1, . . . , T, 1. Predictbp_t;

2. Observe lossℓ_t∈[0,1]^d;

3. [loss update] For eachj= 1, . . . , ddefine v_j,t+1= pbj,te^{−η ℓ}^j,t

Pd

i=1pb_i,te^{−η ℓ}^i,t the current pre-weights, V_t+1 =

vi,s

i∈[d],16s6t+1 thed×(t+ 1)matrix of all past and current pre-weights;

4. [shared update] Definebp_t+1 =ψ_t+1 V_t+1 .

Lemma 1 For allt>1and for allq_t∈∆_d, Algorithm1satisfies b

p_t−q_t⊤

ℓ_t6 1 η

Xd i=1

q_i,tlnv_i,t+1 b pi,t

+η 8 .

Proof By Hoeffding’s inequality, Xd j=1

b

p_j,tℓ_j,t 6−1 η ln



 Xd j=1

b

p_j,te^{−η ℓ}^j,t



+η

8. (1)

By definition ofv_i,t+1, for alli= 1, . . . , dwe then have

Xd j=1

b

p_j,te^{−η ℓ}^j,t = pb_i,te^{−η ℓ}^i,t v_i,t+1 , which entails bp^⊤_t ℓ_t6ℓi,t+1

ηlnv_i,t+1 b

p_i,t +η 8.

The proof is concluded by taking a convex aggregation with respect toq_t.

(5)

3. Shifting bounds

In this section we prove shifting regret bounds for the generalized share algorithm. We compare the cumulative loss PT

t=1bp^⊤_t ℓ_t of the forecaster with the loss of an arbitrary sequence of vectors q₁, . . . ,q_T in the simplex ∆_d, that is, with PT

t=1q^⊤_t ℓ_t. The bounds we obtain depend, of course, on the “regularity” of the comparison sequence. In the now classical results on tracking the best expert (as in Herbster and Warmuth 1998;Vovk 1999;Herbster and Warmuth 2001;

Bousquet and Warmuth 2002), this regularity is measured as the number of timesq_t6=q_t+1(hence- forth referred to as “hard shifts”). The main results of this paper show not only that these results may be generalized to obtain bounds in terms of “softer” regularity measures but that the same algorithms that were proposed with hard shift tracking in mind achieve such, perhaps surprisingly good, performance. Building on the general formulation introduced in Section2, we derive such regret bounds for the fixed-share algorithm ofHerbster and Warmuth(1998) and for the algorithms ofBousquet and Warmuth(2002).

In fact, it is advantageous to extend our analysis so that we not only compare the performance of the forecaster with sequencesq₁, . . . ,q_T taking values in the simplex∆_dof probability distributions but rather against arbitrary sequencesu₁, . . . ,u_T ∈R^d+of vectors with non-negative components.

The loss of such a sequence is defined byPT

t=1u^⊤_t ℓ_t. For fair comparison, we measure the cumulative loss of the forecaster byPT

t=1bp^⊤_t ℓ_tku_tk1. Of course, whenu_t∈∆_d, we recover the original notion of regret.

The normsku₁k1, . . . ,ku_Tk1 may be viewed as a sequence of weights that give more or less importance to the instantaneous loss suffered at each step. Of particular interest is the case when ku_tk1∈[0,1]which is the setting of “time selection functions” (seeBlum and Mansour 2007, Sec- tion 6). In particular, considering sequencesku_tk1 ∈ {0,1}that include the zero vector will provide us a simple way of deriving “adaptive” regret bounds, a notion introduced byHazan and Seshadhri (2009).

The first regret bounds derived below measure the regularity of the sequenceu^T₁ = (u₁, . . . ,u_T) in terms of the quantity

m(u^T₁) =

TX−1 t=1

DTV(ut+1,u_t) (2)

where forx = (x₁, . . . , x_d),y = (y₁, . . . , y_d) ∈ R^d+, we defineD_TV(x,y) = P

xi>yi(x_i −y_i).

Note that whenx,y∈∆_d, we recover the total variation distanceD_TV(x,y) = ¹₂kx−yk1, while for generalx,y∈R^d₊, the quantityD_TV(x,y)is not necessarily symmetric and is always bounded bykx−yk₁. Note that when the vectorsu_tare incidence vectors(0, . . . ,0,1,0, . . . ,0) ∈ R^dof elementsi_t ∈[d], thenm(u^T₁)corresponds to the number of shifts of the sequencei^T₁ ∈[d]^T, and we recover from the results stated below the classical bounds for tracking the best expert.

3.1. Fixed-share update

We now analyze a specific instance of the generalized share algorithm corresponding to the update

b p_j,t+1=

Xd i=1

α

d + (1−α)1i=j

v_i,t+1= α

d + (1−α)v_j,t+1, 06α61. (3)

(6)

Despite seemingly different statements, this update in Algorithm 1can be seen to leadexactly to the fixed-share algorithm ofHerbster and Warmuth(1998) for prediction with expert advice.

Proposition 2 With the above update, for all T > 1, for all sequences ℓ₁, . . . ,ℓ_T of loss vectors ℓ_t∈[0,1]^d, and for allu₁, . . . ,u_T ∈R^d₊,

XT t=1

ku_tk1 pb^⊤_t ℓ_t− XT t=1

u^⊤_tℓ_t6 ku₁k1lnd

η +η

8 XT t=1

ku_tk1

+m(u^T₁) η lnd

α + PT

t=1ku_tk1−m(u^T₁)−1

η ln 1

1−α . We emphasize that the fixed-share forecaster does not need to “know” anything about the sequence of the normsku_tk. Of course, in order to minimize the obtained upper bound, the tuning parameters α, η need to be optimized and their values will depend on the maximal value of m(u^T_i ) for the sequences one wishes to compete against. In particular, we obtain the following corollary, in which h(x) =−xlnx−(1−x) ln(1−x)denotes the binary entropy function forx∈[0,1]. We recall¹ thath(x)6xln(e/x)forx∈[0,1].

Corollary 3 Suppose Algorithm1is run with the update (3). Letm₀ > 0. For allT > 1, for all sequencesℓ₁, . . . ,ℓ_T of loss vectorsℓ_t∈[0,1]^d, and for allq₁, . . . ,q_T ∈∆_dwithm(q^T₁)6m₀,

XT t=1

b p^⊤_t ℓ_t−

XT t=1

q^⊤_tℓ_t6 vu utT

2 (m₀+ 1) lnd+ (T −1) h m0

T −1

! ,

wheneverηandαare optimally chosen in terms ofm0andT.

If we only consider vectors of the formq_t= (0, . . . ,0,1,0, . . . ,0)thenm(q^T₁)corresponds to the number of timesq_t+1 6=q_tin the sequenceq^T₁. We thus recoverHerbster and Warmuth(1998, The- orem 1) andBousquet and Warmuth(2002, Lemma 6) from the much more general Proposition2.

Proof of Proposition2Applying Lemma 1withq_t = u_t/ku_tk1, and multiplying by ku_tk1, we get for allt>1andu_t∈R^d₊

ku_tk1 bp^⊤_tℓ_t−u^⊤_tℓ_t6 1 η

Xd i=1

u_i,tlnvi,t+1

b

p_i,t +η

8ku_tk1 . (4) We now examine

Xd i=1

u_i,tlnv_i,t+1 b pi,t

= Xd i=1

u_i,tln 1 b

pi,t −u_i,t−1ln 1 vi,t

+

Xd i=1

u_i,t−1ln 1

vi,t −u_i,tln 1 vi,t+1

. (5) For the first term on the right-hand side, we have

Xd i=1

u_i,tln 1 b

p_i,t −u_i,t−1ln 1 v_i,t

= X

i:ui,t>ui,t−1

(u_i,t−u_i,t−1) ln 1 b

p_i,t +u_i,t−1lnv_i,t b p_i,t

1. As can be seen by noting thatln 1/(1−x)

< x/(1−x)

(7)

+ X

i:ui,t<ui,t−1

(u_i,t−u_i,t−1) ln 1 v_i,t

| {z }

60

+u_i,tlnvi,t

b p_i,t

. (6)

In view of the update (3), we have1/pbi,t 6d/αandvi,t/pbi,t 61/(1−α). Substituting in (6), we get

Xd i=1

u_i,tln 1 b

p_i,t −u_i,t−1ln 1 v_i,t

6 X

i:ui,t>ui,t−1

(ui,t−ui,t−1) lnd α +



 X

i:ui,t>ui,t−1

ui,t−1+ X

i:ui,t<ui,t−1

ui,t



ln 1 1−α

=D_{T V}(u_t,u_t−1) lnd α +



 Xd i=1

u_i,t− X

i:ui,t>ui,t−1

(u_i,t−u_i,t−1)





| {z }

=kutk1−DT V(ut,ut−1)

ln 1 1−α.

The sum of the second term in (5) telescopes. Substituting the obtained bounds in the first sum of the right-hand side in (5), and summing overt= 2, . . . , T, leads to

XT t=2

Xd i=1

ui,tlnv_i,t+1 b

p_i,t 6m(u^T₁) lnd α +

XT t=2

ku_tk1−1−m(u^T₁)

! ln 1

1−α +

Xd i=1

u_i,1ln 1

v_i,2 −u_i,Tln 1 v_i,T₊₁

| {z }

60

.

We hence get from (4), which we use in particular fort= 1, XT

t=1

ku_tk1bp^⊤_t ℓ_t−u^⊤_t ℓ_t6 1 η

Xd i=1

ui,1ln 1 b p_i,1 +η

8 XT

t=1

ku_tk1

+m(u^T₁) η lnd

α + PT

t=1ku_tk1−1−m(u^T₁)

η ln 1

1−α.

3.2. Sparse sequences: Bousquet-Warmuth updates

Bousquet and Warmuth (2002) proposed forecasters that are able to efficiently compete with the best sequence of experts among all those sequences that only switch a bounded number of times and also take a small number of different values. Such “sparse” sequences of experts appear naturally in many applications. In this section we show that their algorithms in fact work very well in comparison with a much larger class of sequencesu₁, . . . ,u_T that are “regular”—that is,m(u^T₁), defined in (2) is small—and “sparse” in the sense that the quantity

n(u^T₁) = Xd

i=1

t=1,...,Tmax u_i,t

(8)

is small. Note that whenq_t∈∆_dfor allt, then two interesting upper bounds can be provided. First, denoting the union of the supports of these convex combinations byS ⊆[d], we haven(q^T₁)6|S|, the cardinality ofS. Also,

n(q^T₁)6

q_t, t= 1, . . . , T ,

the cardinality of the pool of convex combinations. Thus,n(u^T₁)generalizes the notion of sparsity ofBousquet and Warmuth(2002).

Here we consider a family of shared updates of the form b

p_j,t = (1−α)v_j,t+αw_j,t

Z_t , 06α61, (7)

where the w_j,t are nonnegative weights that may depend on past and current pre-weights and Z_t = Pd

i=1w_i,t is a normalization constant. Shared updates of this form were proposed by Bousquet and Warmuth(2002, Sections 3 and 5.2).

Apart from generalizing the regret bounds of Bousquet and Warmuth (2002), we believe that the analysis given below is significantly simpler and more transparent. We are also able to slightly improve their original bounds.

We focus on choices of the weights w_j,t that satisfy the following conditions: there exists a constantC>1such that for allj= 1, . . . , dandt= 1, . . . , T,

v_j,t6w_j,t 61 and C w_j,t+1 >w_j,t. (8)

The next result improves on Proposition 2whenT ≪ dand n(u^T₁) ≪ m(u^T₁), that is, when the dimension (or number of experts)dis large but the sequenceu^T₁ is sparse.

Proposition 4 Suppose Algorithm1 is run with the shared update (7) with weights satisfying the conditions (8). Then for allT >1, for all sequencesℓ₁, . . . ,ℓ_T of loss vectorsℓ_t∈[0,1]^d, and for all sequencesu₁, . . . ,u_T ∈R^d₊,

XT t=1

ku_tk1 pb^⊤_t ℓ_t− XT t=1

u^⊤_t ℓ_t6n(u^T₁) lnd

η +n(u^T₁)TlnC

η +η

8 XT t=1

ku_tk1

+m(u^T₁)

η lnmax_t6T Z_t

α +

PT

t=1ku_tk1−m(u^T₁)−1

η ln 1

1−α . Proof The beginning and the end of the proof are similar to the one of Proposition2, as they do not depend on the specific weight update. In particular, inequalities (4) and (5) remain the same. The proof is modified after (6), which this time we upper bound using the first condition in (8),

Xd i=1

u_i,tln 1 b

pi,t −u_i,t−1ln 1 vi,t

= X

i:ui,t>ui,t−1

(u_i,t−u_i,t−1) ln 1 b pi,t

+u_i,t−1lnv_i,t b pi,t

+ X

i:ui,t<ui,t−1

(u_i,t−u_i,t−1)

| {z }

60

ln 1 v_i,t

| {z }

>ln(1/wi,t)

+u_i,tlnvi,t

b

p_i,t . (9)

(9)

By definition of the shared update (7), we have1/pbi,t 6Zt/(α wi,t)andvi,t/pbi,t 61/(1−α). We then upper bound the quantity at hand in (9) by

X

i:ui,t>ui,t−1

(u_i,t−u_i,t−1) ln Z_t

α w_i,t

+



 X

i:ui,t>ui,t−1

u_i,t−1+ X

i:ui,t<ui,t−1

u_i,t



ln 1 1−α

+ X

i:ui,t<ui,t−1

(u_i,t−u_i,t−1) ln 1 w_i,t

= D_TV(u_t,u_t−1) lnZ_t

α + ku_tk1−D_TV(u_t,u_t−1) ln 1

1−α+ Xd i=1

(u_i,t−u_i,t−1) ln 1 w_i,t . Proceeding as in the end of the proof of Proposition2, we then get the claimed bound, provided that we can show that

XT t=2

Xd i=1

(u_i,t−u_i,t−1) ln 1

w_i,t 6n(u^T₁) (lnd+TlnC)− ku₁k1lnd , which we do next. Indeed, the left-hand side can be rewritten as

XT t=2

Xd i=1

u_i,tln 1

w_i,t −u_i,tln 1 w_i,t+1

+

XT t=2

Xd i=1

u_i,tln 1

w_i,t+1 −u_i,t−1ln 1 w_i,t

6

XT t=2

Xd i=1

ui,tlnC w_i,t+1 w_i,t

! +

Xd i=1

ui,Tln 1 w_i,T₊₁ −

Xd i=1

ui,1ln 1 w_i,2

!

6

Xd i=1

t=1,...,Tmax u_i,t XT

t=2

lnC w_i,t+1 wi,t

! +

Xd i=1

t=1,...,Tmax u_i,t

ln 1 wi,T+1 −

Xd i=1

u_i,1ln 1 wi,2

!

= Xd i=1

t=1,...,Tmax u_i,t (T −1) lnC+ ln 1 w_i,2

− Xd

i=1

u_i,1ln 1 w_i,2 ,

where we usedC>1for the first inequality and the second condition in (8) for the second inequality. The proof is concluded by noting that (8) entailswi,2 >(1/C)wi,1 >(1/C)vi,1 = 1/(dC)and that the coefficientmax_t=1,...,Tu_i,t−u_i,1in front ofln(1/w_i,2)is nonnegative.

We now generalize Corollaries 8 and 9 ofBousquet and Warmuth(2002) by showing two specific instances of the generic update (7) that satisfy (8). The first update usesw_j,t = max_s6tv_j,s. Then (8) is satisfied withC = 1. Moreover, since a sum of maxima of nonnegative elements is smaller than the sum of the sums,Z_t6min{d, t}6T. This immediately gives the following result.

Corollary 5 Suppose Algorithm 1 is run with the update (7) with wj,t = maxs6tvj,s. For all T >1, for all sequencesℓ₁, . . . ,ℓ_T of loss vectorsℓ_t∈[0,1]^d, and for allq₁, . . . ,q_T ∈∆_d,

XT t=1

b p^⊤_t ℓ_t−

XT t=1

q^⊤_tℓ_t6n(q^T₁) lnd

η +η

8T+m(q^T₁) η lnT

α +T −m(q^T₁)−1

η ln 1

1−α.

(10)

The second update we discuss uses wj,t = maxs6te^γ(s−t)vj,s in (7) for someγ > 0. Both conditions in (8) are satisfied withC =e^γ. One also has that

Zt6d and Zt6X

τ>0

e^−γτ = 1

1−e^−γ 6 1 γ ase^x >1 +xfor all realx. The bound of Proposition4then instantiates as

n(q^T₁) lnd

η +n(q^T₁)T γ

η +η

8T+m(q^T₁)

η lnmin{d,1/γ}

α +T−m(q^T₁)−1

η ln 1

1−α when sequencesu_t=q_t∈∆_dare considered. This bound is best understood whenγis tuned optimally based onT and on two boundsm₀andn₀over the quantitiesm(q^T₁)andn(q^T₁). Indeed, by optimizingn₀T γ+m₀ln(1/γ), i.e., by choosingγ =m₀/(n₀T), one gets a bound that improves on the one of the previous corollary:

Corollary 6 Letm₀, n₀>0. Suppose Algorithm1is run with the updatew_j,t= max_s6te^γ(s−t)v_j,s whereγ=m₀/(n₀T). For allT >1, for all sequencesℓ₁, . . . ,ℓ_T of loss vectorsℓ_t∈[0,1]^d, and for allq₁, . . . ,q_T ∈∆_dsuch thatm(q^T₁)6m₀andn(q^T₁)6n₀, we have

XT t=1

b p^⊤_tℓ_t−

XT t=1

q^⊤_t ℓ_t 6 n₀lnd η +m₀

η

1 + ln min

d, n₀T m₀

+η

8T+m₀ η ln 1

α+ T−m₀−1

η ln 1

1−α.

As the factors e^−γt cancel out in the numerator and denominator of the ratio in (7), there is a straightforward implementation of the algorithm (not requiring the knowledge ofT) that needs to maintain onlydweights.

In contrast, the corresponding algorithm of Bousquet and Warmuth(2002), using the updates b

p_j,t = (1−α)v_j,t+αS⁻¹_t P

s6t−1(s−t)⁻¹v_j,sorpb_j,t= (1−α)v_j,t+αS_t⁻¹max_s6t−1(s−t)⁻¹v_j,s, whereS_tdenote normalization factors, needs to maintain O(dT) weights with a naive implementation, andO(dlnT) weights with a more sophisticated one. In addition, the obtained bounds are slightly worse than the one stated above in Corollary6as an additional factor ofm₀ln(1 + lnT)is present inBousquet and Warmuth(2002, Corollary 9).

4. Adaptive regret

Next we show how the results of the previous section, e.g., Proposition 2, imply guarantees in terms of adaptive regret —a notion introduced by Hazan and Seshadhri(2009) as follows. For τ0 ∈ {1, . . . , T}, theτ0-adaptive regret of a forecaster is defined by

R^τT⁰^−adapt= max

[r, s]⊂[1, T] s+ 1−r6τ⁰

( _s X

t=r

b

p^⊤_t ℓ_t− min

q∈∆d

Xs t=r

q^⊤ℓ_t )

. (10)

Adaptive regret is an alternative way to measure the performance of a forecaster against a changing environment. It is a straightforward observation that adaptive regret bounds also lead to shifting

(11)

regret bounds (in terms of hard shifts). Here we show that these two notions of regret share an even tighter connection, as they can be both viewed as instances of the same alma mater bound, e.g., Proposition2.

Hazan and Seshadhri(2009) essentially considered the case of online convex optimization with exp-concave loss function (see Section 6 below). In case of general convex functions, they also mentioned that the greedy projection forecaster ofZinkevich (2003) —i.e., mirror descent with a quadratic regularizer— enjoys adaptive regret guarantees. This forecaster can be implemented on the simplex in timeO(d)—see, e.g.,Duchi et al.(2008). We now show that the simpler fixed-share algorithm has a similar adaptive regret bound.

Proposition 7 Suppose that Algorithm1is run with the shared update (3). Then for allT >1, for all sequencesℓ₁, . . . ,ℓ_T of loss vectorsℓ_t∈[0,1]^d, and for allτ₀ ∈ {1, . . . , T},

R^τT⁰^−adapt6 1 ηlnd

α +τ0−1

η ln 1

1−α +η 8τ0. In particular, whenηandαare chosen optimally (depending onτ₀andT)

R^τ_T⁰^−adapt6 sτ₀

2

τ₀h 1

τ₀

+ lnd

6 rτ₀

2 ln(edτ₀).

Proof For1 6 r 6 s 6 T and q ∈ ∆_d, the regret in the right-hand side of (10) equals the regret considered in Proposition 2against the sequenceu^T₁ defined asu_t = qfort = r, . . . , sand 0 = (0, . . . ,0) for the remaining t. When r > 2, this sequence is such thatD_{T V}(u_r,u_r−1) = D_{T V}(q,0) = 1and D_{T V}(u_s+1,u_s) = D_{T V}(0,q) = 0so thatm(u^T₁) = 1, whileku₁k₁ = 0.

Whenr= 1, we haveku₁k1 = 1andm(u^T₁) = 0. In all cases,m(u^T₁) +ku₁k1= 1. Specializing the bound of Proposition2to the thus defined sequenceu^T₁ gives the result.

5. Online tuning of the parameters

The forecasters studied above need their parametersη and αto be tuned according to various quantities, including the time horizonT. We show here how the trick ofAuer et al.(2002) of having these parameters vary over time can be extended to our setting. For the sake of concreteness we focus on the fixed-share update, i.e., Algorithm1run with the update (3). We respectively replace steps 3 and 4 of its description by the loss and shared updates

v_j,t+1= pb

ηt ηt−1

j,t e^−η^t^ℓ^j,t Pd

i=1pb

ηt ηt−1

i,t e^−η^t^ℓ^i,t

and p_j,t+1= α_t

d + (1−α_t)v_j,t+1, (11) for allt>1and allj ∈[d], where(ητ)and (ατ)are two sequences of positive numbers, indexed byτ > 1. We also conventionally defineη0 = η1. Proposition 2is then adapted in the following way (whenη_t≡ηandα_t≡α, Proposition2is exactly recovered).

(12)

Proposition 8 The forecaster based on the above updates (11) is such that whenever ηt 6 ηt−1

andα_t6 α_t−1 for allt>1, the following performance bound is achieved. For allT >1, for all sequencesℓ₁, . . . ,ℓ_T of loss vectorsℓ_t∈[0,1]^d, and for allu₁, . . . ,u_T ∈R^d+,

XT t=1

ku_tk1 bp^⊤_tℓ_t− XT

t=1

u^⊤_t ℓ_t6 ku_tk1

η₁ + XT t=2

ku_tk1

1 η_t − 1

η_t−1

! lnd

+m(u^T₁)

η_T lnd(1−α_T)

α_T +

XT t=2

ku_tk1

η_t−1 ln 1 1−α_t+

XT t=1

η_t−1

8 ku_tk1 . Due to space constraints, we only instantiate the obtained bound to the case of T-adaptive regret guarantees, whenT is unknown and/or can increase without bounds.

Corollary 9 The forecaster based on the above updates withη_t = q

ln(dt)

/t for t > 3 and η₀ =η₁ =η₂ =η₃ on the one hand,α_t= 1/ton the other hand, is such that for allT >3and for all sequencesℓ₁, . . . ,ℓ_T of loss vectorsℓ_t∈[0,1]^d,

[r,s]⊂[1,Tmax ]

( _s X

t=r

b

q∈∆d

Xs t=r

q^⊤ℓ_t )

6p

2Tln(dT) +p

3 ln(3d).

Proof The sequencen 7→ ln(n)/nis only non-increasing after round n > 3, so that the defined sequences of(α_t)and(η_t)are non-increasing, as desired. For a given pair(r, s)and a givenq ∈∆_d, we consider the sequenceν₁^T defined in the proof of Proposition7; it satisfies thatm(u^T₁)61and ku_tk161for allt>1. Therefore, Proposition8ensures that

Xs t=r

b

q∈∆d

Xs t=r

q^⊤ℓ_t6 lnd η_T + 1

η_T lnd(1−αT) α_T

| {z }

6dT

+

XT t=2

1

η_t−1 ln 1 1−α_t

| {z }

6(1/ηT)PT

t=2ln(t/(t−1))=(lnT)/ηT

+ XT t=1

ηt−1

8 .

It only remains to substitute the proposed values ofη_tand to note that XT

t=1

η_t−1 63η₃+

TX−1 t=3

√1 t

pln(dT)63

rln(3d) 3 + 2√

Tp

ln(dT).

6. Online convex optimization and exp-concave loss functions

By using a standard reduction, the results of the previous sections can be applied to online convex optimization on the simplex. In this setting, at each steptthe forecaster chooses bp_t ∈ ∆_d and then is given access to a convex loss ℓt : ∆_d → [0,1]. Now, using Algorithm 1 with the loss vectorℓ_t ∈ ∂ℓ_t(bp_t) given by a subgradient ofℓ_tleads to the desired bounds. Indeed, by the convexity ofℓ_t, the regret at each timetwith respect to any vectoru_t∈R^d₊withku_tk1 >0is then bounded as

ku_tk1

ℓ_t(pb_t)−ℓ_t u_t

ku_tk1

6 ku_tk1bp_t−u_t⊤

ℓ_t.

(13)

6.1. Exp-concave loss functions

Recall that a loss functionℓ_tis calledη₀-exp-concave ife^−η⁰^ℓ^t is concave. (In particular, exp- concavity implies convexity.) Bousquet and Warmuth(2002) study shifting regret for exp-concave loss functions. However, they define the regret of an elementq^T₁ of the comparison class (a sequence of elements in∆_d) by

XT t=1

ℓt bp_t

−q^⊤_t ℓ_t

(12) where ℓ_t = ℓ_t(e₁), . . . , ℓ_t(e_d)

and e₁, . . . ,e_d are the elements of the canonical basis of R^d. This corresponds to the linear optimization case studied in the previous sections. However, due to exp-concavity, (1) can be replaced by an application of Jensen’s inequality, namely,

ℓ_t bp_t 6−1

η₀ln



 Xd j=1

b

p_j,te^−η⁰^ℓ^t^(e^j⁾



 .

Hence the various propositions and corollaries of Sections3and4still hold true for the regret (12) up to some modifications (deletion of the terms linear inη, assumption of exp-concavity, bounded- ness no longer needed). For the sake of concreteness, we illustrate the required modifications on Proposition4.

Proposition 10 Suppose Algorithm1is run with the shared update (7) with weights satisfying the conditions (8) and for the choice η = η0. Then for all T > 1, for all sequences ℓ1, . . . , ℓT of η₀–exp-concave loss functions, and for all sequencesu₁, . . . ,u_T ∈R^d+,

XT t=1

ku_tk1ℓ_t bp_t

− XT

t=1

u^⊤_tℓ_t6(u^T₁) lnd η0

+n(u^T₁)TlnC η0

+ m(u^T₁) η0

lnmax_t6TZ_t

α +

PT

t=1ku_tk1−m(u^T₁)−1 η0

ln 1 1−α . We now turn to the more ambitious goal of controlling regrets of the formPT

t=1 ℓ_t pb_t

−ℓ_t(q_t) where losses ℓ_t are exp-concave. Hazan and Seshadhri (2009) constructed algorithms with T– adaptive regret of the order ofO(ln²T)and running in timepoly(d,logT). They also constructed different algorithms withT–adaptive regret bounded byO(lnT)) and running timepoly(d, T).

Next, we show the first logarithmic shifting bounds for exp-concave loss functions. However, we only do so against sequencesq^T₁ of elements in∆_d, i.e., we offer here no general bound in terms of linear vectorsu^T₁ that would unify here as well the view between tracking bounds and adaptive regret bounds. Besides, we get shifting bounds only in terms of hard shifts

s(q^T₁) =t= 2, . . . , T : q_t6=q_t−1 .

Obviously, getting unifying bounds in terms of soft shifts of sequences u^T₁ of linear vectors is an important open question, which we leave for future research. To get our bound, we mix ideas of Herbster and Warmuth(1998) andBlum and Kalai(1997). We define a prior over the sequences of convex weight vectors as the distribution of the following homogeneous Markov chainQ₁,Q₂, . . .:

The starting vectorQ₁is drawn at random according to the uniform distributionµover∆_d. Then,

(14)

givenQ_t−1, the next elementQ_tis equal toQ_t−1with probability1−αand with probabilityαis drawn at random according toµ. In the sequel, all probabilitiesPand expectationsEwill be with respect to this Markov chain. Now, the convex weight vector used at timet>1by the forecaster is

b p_t= Eh

Q_te^−η⁰^L^t−¹^(Q^t−¹ ¹⁾i Eh

e^−η⁰^L^t−1^(Q^t−¹ ¹⁾i , where L_t−1(Q^t−1₁ ) = Xt−1 s=1

ℓ_s(Q_s) (13)

(with the convention that an empty sum is null). For this forecaster, we get the following performance bound, whose proof can be found in appendix.

Proposition 11 For all T > 1, for all sequences ℓ₁, . . . , ℓ_T of η₀–exp-concave loss functions taking values in [0, L], the cumulative loss of the above forecaster is bounded for all sequences q₁, . . . ,q_T ∈∆_dby

XT t=1

ℓ_t(pb_t)− XT t=1

ℓ_t(q_t)6 s(q^T₁) + 1 (d−1)

η max

(

1, ln e η L T s(q^T₁) + 1

(d−1) )

+s(q^T₁) η ln1

α +T−s(q^T₁)−1

η ln 1

1−α . Under the imposition of a bounds₀ on the numbers of hard shiftss(q^T₁)and up to a tuning ofαin terms ofs0andT, the last two terms of the bound are smaller thanT h(s0/T)6s0 ln(es0/T)and therefore, the whole regret bound isO (ds₀/η₀) lnT

.

(15)

References

P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident on-line learning algorithms.

Journal of Computer and System Sciences, 64:48–75, 2002.

A. Blum and A. Kalai. Universal portfolios with and without transaction costs. In Proceedings of the 10th Annual Conference on Learning Theory (COLT), pages 309–313. Springer, 1997.

A. Blum and Y. Mansour. From extermal to internal regret. Journal of Machine Learning Research, 8:1307–1324, 2007.

O. Bousquet and M.K. Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3:363–396, 2002.

N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto theℓ₁–ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, ICML 2008, 2008.

A. György, T. Linder, and G. Lugosi. Tracking the best of many experts. In Proceedings of the 18th Annual Conference on Learning Theory (COLT), pages 204–216, Bertinoro, Italy, Jun. 2005.

Springer.

E. Hazan and C. Seshadhri. Efficient learning algorithms for changing environment. Proceedings of the 26th International Conference of Machine Learning (ICML), 2009.

M. Herbster and M. Warmuth. Tracking the best expert. Machine Learning, 32:151–178, 1998.

M. Herbster and M. Warmuth. Tracking the best linear predictor. Journal of Machine Learning Research, 1:281–309, 2001.

V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35(3):247–282, Jun.

1999.

M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Pro- ceedings of the 20th International Conference on Machine Learning, ICML 2003, 2003.

(16)

Appendix A. Proof of Proposition8 We first adapt Lemma1.

Lemma 12 The forecaster based on the loss and shared updates (11) satisfies, for allt>1and for allq_t∈∆_d,

b

p_t−q_t⊤

ℓ_t6 Xd

i=1

q_i,t 1

η_t−1 ln 1 pb_i,t − 1

η_tln 1 v_i,t+1

+

1 η_t − 1

η_t−1

lnd+η_t−1 8 , wheneverη_t6η_t−1.

Proof By Hoeffding’s inequality, Xd

j=1

b

p_j,tℓ_j,t 6− 1 η_t−1 ln



 Xd j=1

b

p_j,te^−η^t−¹^ℓ^j,t



+η_t−1 8 .

By Jensen’s inequality, sinceη_t6η_t−1and thusx7→x^ηt−1^ηt is convex, 1

d Xd j=1

b

p_j,te^−η^t−¹^ℓ^j,t = 1 d

Xd j=1

b p

ηt ηt−1

j,t e^−η^t^ℓ^j,t ^ηt−1_ηt

>



1 d

Xd j=1

b p

ηt ηt−1

j,t e^−η^t^ℓ^j,t





ηt−1 ηt

.

Substituting in Hoeffding’s bound we get b

p^⊤_tℓ_t6−1 η_tln



 Xd j=1

b p

ηt ηt−1

j,t e^−η^t^ℓ^j,t



+ 1

η_t − 1 η_t−1

lnd+η_t−1 8 .

Now, by definition of the loss update in (11), for alli∈[d], Xd

j=1

b p

ηt ηt−1

j,t e^−η^t^ℓ^j,t = 1 v_i,t+1pb

ηt ηt−1

i,t e^−η^t^ℓ^i,t,

which, after substitution in the previous bound leads to the inequality b

p^⊤_t ℓ_t6ℓ_i,t+ 1 η_t−1 ln 1

b p_i,t − 1

η_tln 1 v_i,t+1 +

1 η_t − 1

η_t−1

lnd+η_t−1 8 ,

valid for alli∈ [d]. The proof is concluded by taking a convex aggregation overiwith respect to q_t.

The proof of Proposition8follows the steps of the one of Proposition2; we sketch it below.

Proof of Proposition8Applying Lemma12withq_t =u_t/ku_tk₁, and multiplying byku_tk₁, we get for allt>1andu_t∈R^d+,

(17)

ku_tk₁ bp^⊤_t ℓ_t−u^⊤_tℓ_t6 1 ηt−1

Xd i=1

u_i,tln 1 b pi,t − 1

ηt

Xd i=1

u_i,tln 1 vi,t+1

+ku_tk₁ 1

η_t − 1 η_t−1

lnd+η_t−1

8 ku_tk₁ . (14) We will sum these bounds over t > 1 to get the desired result but need to perform first some additional boundings fort>2; in particular, we examine

1 η_t−1

Xd i=1

ui,tln 1 b p_i,t − 1

η_t Xd i=1

ui,tln 1 v_i,t+1

= 1 η_t−1

Xd i=1

u_i,tln 1

pb_i,t −u_i,t−1ln 1 v_i,t

+

Xd i=1

u_i,t−1 η_t−1 ln 1

v_i,t −u_i,t η_t ln 1

v_i,t+1

, (15) where the first difference in the right-hand side can be bounded as in (6) by

Xd i=1

ui,tln 1 b

pi,t −ui,t−1ln 1 vi,t

6 X

i:ui,t>ui,t−1

(u_i,t−u_i,t−1) ln 1 b pi,t

+u_i,t−1lnv_i,t b pi,t

+ X

i:ui,t<ui,t−1

u_i,tlnv_i,t b pi,t

6 D_{T V}(u_t,u_t−1) ln d

α_t+ ku_tk1−D_{T V}(u_t,u_t−1) ln 1

1−α_t 6 DT V(ut,u_t−1) lnd(1−αT)

α_T +ku_tk1ln 1

1−α_t, (16)

where we used for the second inequality that the shared update in (11) is such that1/pbi,t 6d/αtand v_i,t/pb_i,t 61/(1−α_t), and for the third inequality, thatα_t >α_T andx7→(1−x)/xis increasing on(0,1]. Summing (15) overt= 2, . . . , T using (16) and the fact thatη_t>η_T, we get

XT t=2

1 η_t−1

Xd i=1

u_i,tln 1 b p_i,t − 1

η_t Xd i=1

u_i,tln 1 v_i,t+1

!

6 m(u^T₁)

η_T lnd(1−α_T)

α_T +

XT t=2

ku_tk1

η_t−1 ln 1 1−α_t +

Xd i=1

u_i,1 η₁ ln 1

v_i,2 −u_i,T

η_T ln 1 v_i,T₊₁

| {z }

>0

.

An application of (14) —including fort= 1, for which we recall thatpb_i,1 = 1/dand η₁ = η₀ by

convention— concludes the proof.