A Closer Look at Regret - Online Convex Optimization: Algorithms, Learning, and Duality

From Cover’s impossibility result²⁴(Proposition 2.3.3), we know that there is no deterministic player oracle which attains sublinear policy regret in general. However, evenrandomized player oracles have no hope of attaining sublinear (in the number of rounds) expected policy regret.

Theorem 2.7.1 (Based on [5, Theorem 1]). Let P := ({1,2}²,{1,2},[−1,1]², L) be a prediction with expert advice problem, where²⁵ L(d, y) := y_d for every d∈ {1,2} and y ∈Y. Let PLAYER be a randomized player oracle for P on a probability space (Ω,Σ,P). Then, there are nature and measurable enemy oraclesNATUREandENEMYforP, respectively, such that, for anyT ∈N\ {0}

and for the hypothesis setH:={x∈ {1,2}²7→xi:i∈ {1,2}}, we have

maxh∈HE[PRegret_T(NATURE,PLAYER,ENEMY, h)]≥ T−1 2 . Proof. Lety∈ {1,2}be such that

P PLAYER(h(1,2)^Ti,hi) =y

≥ 1 2.

One exists since the player is bound to make a choice from a pool of 2 options. Moreover, define nature and enemy oraclesNATURE and ENEMY^∗_PLAYER for P, respectively, by

NATURE(t) := (1,2)^T for each t∈N,

ENEMY^∗_PLAYER(x,d) := [t >1and d₁=y]1 for each t∈N, x∈({1,2}²)^t, andd∈({1,2}²)^t−1. Note that for anyT ∈N\ {0} and anyx∈(A^E)^T, we have thatENEMY^∗_PLAYER(x,·)is a constant function onA^T⁻¹ and, thus, measurable. LetT ∈N, and define

(x,D,Y) := OCOP(NATURE,PLAYER,ENEMY^∗_PLAYER, T).

By the definition of the enemy oracleENEMY^∗_PLAYER, the player suffers a loss of 1at every round (except for the first) if she picksy on the first round, and suffers a loss of 0otherwise. Thus,

E hX^T

t=1

L(Dt, Yt) i

= (T −1)P(D1 =y)≥ T−1 2 . Letz∈ {1,2} \ {y}, seth_z(x) :=x_z for every x∈ {1,2}², and define

u:=hh_z(x₁), h_z(x₂), . . . , h_z(x_T)i.

Then,u1=hz(x1) =x1(z) =z6=y. Hence, by definition,

ENEMY^∗_PLAYER(x1:t,u1:t−1) = 0 for eacht∈[T]

=⇒

t=1

L(hz(xt),ENEMY^∗_PLAYER(x1:t,u1:t−1)) = 0.

Sinceh_z ∈ H, we are done.

24Although Cover’s result is about regret, the enemy built in the statement of the proposition is oblivious. Thus, one can check that regret and policy regret are equal for such an enemy, and the result holds for policy regret.

25Note thatLis measurable.

Even though the original result from [5] is slightly more general, this version of the theorem which looks at a specific instance of the prediction with experts’ problem can be easily compared to Cover’s impossibility result (Proposition 2.3.3). Additionally, by Proposition 2.6.2 we know that the randomized experts problem can be reduced, in some sense, to an online convex optimization instance over the simplex. For this latter OCO problem, we are going to see that there are player oracles which attain regret sublinear in the number of rounds. This shows that policy regret is way harder to handle than the traditional notion of regret.

Policy regret is not the focus of our text, so we limit its discussion to this section. Still, we point the reader who is interested in policy regret minimization to [5] and [22]. Interestingly, in both of these papers the authors obtain more interesting results when the enemy is oblivious to

“old” rounds, that is, the enemy is a function only of the last c rounds, wherec is some constant.

Finally, there are some other variations of regret which are useful for some problems and in some applications of online learning to other fields. For more information on different regret variations, see [24, Sections 4.4 and 4.6].

Chapter 3

Convex Analysis, Optimization, and Duality Theory

It is not surprising that the description and analysis of most algorithms for online convex optimization, which is the focus of the remainder of the text, heavily relies on ideas from convex analysis and optimization. Even though the early developments in OCO could be seen in a mostly self-contained way, the field has recently been experiencing a more coherent and unified progress, mainly due to the use of powerful ideas from convex analysis and optimization, chiefly the ideas from convex duality theory based on the Hyperplane Separation Theorem and on Fenchel conjugates of functions. The presentation of the algorithms on next chapters follows this latter trend since it reveals interesting insights about the inner-working and connections among OCO algorithms. However, simply requiring the reader to have a background on convex analysis heavily restricts the accessibility of this text.

In this chapter we overview the main concepts from convex analysis which we use throughout the remainder of the text, with a focus on building intuition. The presentation of this chapter aims to be of use for both proficientand inexperienced readers when it comes to convex analysis. For the former group, this chapter serves as a revision of the main concepts from convex analysis we use throughout the text, together with some proofs of more specific results which are used on later chapters. For those who are having one of their first contacts with convex analysis through this text, we aim to build most of the intuition necessary to understand the descriptions and analyses of the main algorithms presented in this text. For the sake of conciseness and simplicity, we do not prove many results that we state in this chapter, although leave references the interested reader in such cases.

Recall from Section 1.1 that, throughout the whole text, Edenotes an euclidean space (finite-dimensional real vector space equipped with an inner product) whose inner product we denote byh·,·i.

Moreover, throughout the remainder of the text we equipR^d with the euclidean inner product (x, y)∈R^d×R^d7→x^Ty, and we equipS^dwith thetrace inner product(X, Y)∈S^d×S^d7→Tr(XY).

Finally, this chapter is mainly based on [59], although it also draws from many other sources such as [15, 17, 18, 55]

3.1 Convex Sets and Functions

A set C⊆E isconvex ifλx+ (1−λ)y ∈C for any λ∈[0,1]and any x, y∈C, and C isaffineif λx+ (1−λ)y∈C for any λ∈Rand for any x, y∈C. That is, a setC ⊆Eis convex if and only if, for anyx, y∈C, theline segment(between xandy) given by[x, y] :={λx+ (1−λ)y :λ∈[0,1]}

is entirely contained in C. Moreover, a set C ⊆ E is affine if the line that passes through any

two distinct points x, y∈C is contained inC. Note that intersections of convex (affine) sets are convex (affine).

Let us now define convex functions. We do so by looking at the set formed by the graph of the function, in some sense. This way of looking at functions is useful since results and concepts for convex sets can often be translated into analogous results about convex functions almost seamlessly.

Formally, letf:S→RwhereS ⊆E. Theepigraph off is the set epif :={x⊕µ∈S⊕R:f(x)≤µ},

and f is convexifepif is convex. That is, the epigraph of a functionf is the set built by taking the graph off and extruding it upwards. On Figure 3.1 we present a graphic representation of the epigraph of a two-dimensional function.

Figure 3.1: Illustration of the epigraph of a (non-convex) function f.

In this text we will follow the same convention used in [59]: all functions we deal with can be evaluated everywhere inE, even though they can take on infinite values. The arithmetic properties of +∞ (which we often denote simply by ∞) and −∞that we use are the same the author of [59]

uses. Namely,

α+∞= +∞+α= +∞ and α− ∞=−∞+α=−∞ for allα∈R, α(+∞) = (+∞)α= (+∞) and α(−∞) = (−∞)α=−∞ for allα∈R++, α(+∞) = (+∞)α=−∞ and α(−∞) = (−∞)α= +∞ for allα∈ −R++,

0(+∞) = (+∞)0 = 0 and 0(−∞) = (−∞)0 = 0, +∞+∞= (+∞) and −∞ − ∞=−∞,

inf∅= (+∞) and sup∅=−∞.

We note that the expressions+∞ − ∞and −∞+∞ are not define and, thus, are utterly avoided.

We do not lose any generality with this assumption over the values functions can take since a convex function f defined only in a subset S ⊆E can be extended by setting f(x) := +∞ for

every x∈E\S. This extension preserves the epigraph and, thus, convexity. The usefulness of this convention is that it makes many proofs and results less technical, just needing, in some cases, some care with the (non-)finiteness at some points. The (effective) domain of f:E → [−∞,+∞] is domf := {x∈E:f(x)6= +∞}(see Figure 3.1 for an example of effective domain of a function).

While+∞ is used to indicate places outside the domain off, functions which take the value−∞

somewhere are, in some sense, pathological. Not only that, we want to avoid dealing with functions which are infinite everywhere. Thus, we will almost always deal with proper functions: a function f:E→[−∞,+∞]is properifepif is nonempty, that is, domf is nonempty andf(x)6=−∞ for everyx∈E.

Finally, one can prove that a function f:E→(−∞,+∞]is convex if and only if it satisfies the more familiar condition

f(λx+ (1−λ)y)≤λf(x) + (1−λ)f(y), ∀x, y∈E,∀λ∈[0,1].

Even though intuitive, the above characterization of convex functions may be hard to show convexity of some given function. The next lemma will be useful later on to show convexity of two-times continuously differentiable functions.

Lemma 3.1.1 ([59, Theorem 4.5] or [15, Corollary 2.1]). Let f: R^d → (−∞,+∞] be a proper two-times continuously differentiable ondomf function such thatdomf is convex. Thenf is convex if and only if∇²f(x)0for any x∈domf.

Some functions are very useful when stating or proving convex analysis results. For any setC ⊆E, define

• the indicator functionδ(· |C) ofC by δ(x|C) := 0for every x∈C and δ(x|C) := +∞ for everyx∈E\C, and

• thesupport functionδ^∗(· |C)of C⊆Eby δ^∗(x|C) := sup{ hy, xi:y∈C}.

The best way to build intuition about indicator functions is by looking at its epigraph. For example, the epigraph of δ(· |[−1,1]) is simply the (infinity) rectangle onR² formed by taking the segment from(−1,0)to(1,0)and extruding it upwards (i.e., in the direction(0,1)). More generally, ifC ⊆E, then the epigraph of δ(· |C) it is the setC embedded in the hyperplane E⊕0and extruded in the direction 0⊕1∈E⊕R.

The intuition on the support function for a convex set C ⊆ E, on the other hand, is better pictured in a different way. Namely, let a∈C be such thatβ¯:=δ^∗(a|C)is finite. For every β ∈R we have the hyperplane H(β) :={x∈E:ha, xi=β}, and two associated (closed) half-spaces

H^≤(β) :={x∈E:ha, xi ≤β} and H^≥(β) :={x∈E:ha, xi ≥β}

which, in some sense, divide Ein two almost disjoint sets. In particular, H( ¯β) is also a hyperplane, and by the definition of support function, we haveha, xi ≤δ^∗(a|C) = ¯β for everyx∈C, that is, we have C⊆H^≤( ¯β). Thus, H( ¯β) is such that the set C is entirely contained in one of its half-spaces.

Not only that, by the definition of support function we have β¯= inf{β ∈R:C⊆H(β)}. In words, β¯ is the minimum value of β ∈R such that C in contained into one of the half-spaces associated with H(β). Finally, on Section 3.4 it will become clear why the notations for indicator and support function are similar.

Given a set S ⊆E, we are sometimes interested in the smallest set with some property that contains S. For example, the smallest affine set that contains S tells us, in some sense, if we could fit the points ofS in a space of smaller dimension. For any S⊆E, define

• affS:=T{M ⊆E:S ⊆M and M is affine}, called the affine hullof S;

• convS:=T

{C ⊆E:S⊆C andC is convex}, called theconvex hull ofS.

Even though the above operations will not be used very often in the remainder of the text, they are important in some results and definitions that we see in this chapter. However, one may find it hard to have much intuition on the hull operations with the definitions given above. The following result shows an easier way to see the hull operations: the affine (convex) hull of a set S⊆E is the set of all affine (convex) combinations of finite subsets of points inS.

Proposition 3.1.2 (see [59, Chapter 1] and [59, Theorem 2.3]). For any S⊆E, we have affS =

nX^m

i=1

λixi ∈E:for every m∈N, every{x_i}^m_i=1 ⊆S, and everyλ∈R^m s.t.

i=1

= 1 o

and convS=

nX^m

i=1

λixi∈E:for every m∈N, every{x_i}^m_i=1⊆S, and every λ∈R^m₊ s.t.

i=1

= 1 o

. For concreteness, let us look at a small example. Define S := {(1,0),(2,1)} ⊆ R². With the above proposition, one can easily see thatconvS is the line segment between the points (1,0)and (2,1)and thataffS is the line that passes through (1,0) and(2,1). By setting S⁰ :=S∪ {(2,0)}, we have thatconvS⁰ is the region enclosed by the triangle formed by the points inS⁰ and, interestingly, affS⁰ is the entire space R².

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 43-49)