• Nenhum resultado encontrado

2.6 From Online Learning to Online Convex Optimization

2.6.2 Randomization

Recall that Proposition 2.3.3 shows that for simple instances of prediction with expert advice it is impossible to find a player oracle which attains regret sublinear in the number of rounds. That proposition relies on the possibility of the enemy predicting exactly which is the next player prediction in the online learning setting. What if, instead of making predictions deterministically, the player decided only on a probability distribution over her possible predictions, and left the actual choice for randomness to take care of? Let us take the online binary classification case as an example. Given a query, the player may think that there is a60% chance of it being from the class1, for example. If the player decides to deterministically pick the class in which she is more confident, the enemy will be able to exploit that. If, instead, we flip a biased coin which the enemydoes not have access to, we take away part of this advantage. Randomizing the choices seems even more appealing when we have a greater number of choices and the confidence that the player has about each choice is small. In view of this discussion, we can change the model to allow the player oracle in the online learning setting to randomize its predictions, and restrict the access to information of the enemy oracle: it will not have access to the “random bits” played. For example, consider a player oracle in the prediction with expert advice problem that, instead of choosing an expert deterministically, samples one from some probability distribution. The key here is that the enemy is able to simulate this player oracle and see the probability of each expert being sampled that round, but the enemy does not know which expert gets sampled before making a decision.

Formally, letP := (X, Y, D, L)be an online learning instance such thatY andDare each equipped with a σ-algebra21 and such thatL is measurable. Additionally, let(Ω,Σ,P)be a probability space.

21IfD is a finite set orR, there are natural σ-algebras over them. Namely, the power set ofDand the (Borel) σ-algebra generated by the open intervals in the real line, respectively. Whenever we are in one of these cases, we assumeDis equipped with such aσ-algebra, unless stated otherwise.

A randomized player oracle forP is a function PLAYER : Seq(X)×Seq(Y) →D such that every functionF: Ω→D in the image ofPLAYERis measurable, that is,F is a random variable over (Ω,Σ,P) that takes values in D. In Algorithm 2.4 we overload the definition of OLP for randomized player oracles, making it clear which information each oracle has access to. Moreover, in the definition of OLP for randomized player oracles we also naturally suppose that ENEMY is a measurable enemy oracleforP, that is, we assume that for eachT ∈N\ {0}and everyx∈XT be have that ENEMY(x,·) is a measurable function from YT−1 to D. Finally, the definition of regret for randomized player oracles is similar to the definition of regret for deterministic player oracles seen earlier. Note, however, that the function Regretfor online learning becomes a random variable in the case of randomized player oracles (given that the enemy oracle is a measurable enemy oracle and that the loss function is measurable).

Algorithm 2.4 Definition of[OLP(NATURE,PLAYER,ENEMY, T)](ω) (overloadingOLP) Input:

(i) An OL instance P = (X, D, Y, L) such that D andY are each equipped with a σ-algebra and Lis Borel measurable,

(ii) nature and measurable enemy oracles for P denoted, respectively, byNATUREandENEMY, (iii) a randomized player oraclePLAYERon a probability space (Ω,Σ,P),

(iv) a number T ∈Nof rounds, and

(v) an elementeω ∈Ω(the “random bits”).

Output: (x,d,y)∈XT ×DT ×YT. fort= 1 to T do

xt←NATURE(t)

Dt←PLAYER hx1, . . . , xti,hY1(ω), . . . , Yt−1(ω)i Yt(ω)←ENEMY hx1, . . . , xti,hD1(ω), . . . , Dt−1(ω)i return (x,hD1(ω), . . . , DT(ω)i,hY1(ω), . . . , YT(ω)i)

Let us revisit the prediction with expert advice problem. On Proposition 2.5.5, we have looked at the expert instances on which the loss function was convex w.r.t. its first argument and the advice set was convex. In this case, it was enough to have a player oracle for the OCO instance C := (∆E,F), where E is the set of experts and F is a set of linear functions, to build a player oracle for the original experts problem with the same regret guarantees. The next proposition shows how to build a randomized player oracle for the experts problem from a player oracle for the OCO instance C such that the expected regret on the experts’ problem is the same as the regret of the player forC against a properly chosen enemy.

Proposition 2.6.2. Let P := (AE, A, Y, L) be a prediction with expert advice problem such that AandY are each equipped with aσ-algebra andLis measurable. Moreover, let C:= (∆E,F) be an OCO instance where

F :={p∈RE 7→pTc:c∈[−1,1]E}.

Finally, let PLAYEROCO be a player oracle for C and let U := {ei ∈ {0,1}E :i∈E}. Then, there exists a randomized player oracle PLAYEROL for P such that, for any T ∈ N and any sequences x ∈(AE)T and y ∈ YT, there is f ∈ FT such thatE[Regret(x,PLAYEROL,y,H)] = Regret(PLAYEROCO,f, U), where

Proof. For everyx∈AE andy ∈Y, definec(x, y)∈[−1,1]E by (c(x, y))e:=L(x(e), y), ∀e∈E.

Let (Ω,Σ,P) be a probability space such that, for each T ∈ N\ {0} and p ∈ ∆E, there is an independent random variable22 Ip: Ω→E such that P(Ip =e) =pe for each e∈E. Finally, define the randomized player oracle PLAYEROL for P given, for every T ∈N,x∈XT, and y∈YT−1 by

[PLAYEROL(x,y)](ω) :=xT(IpT(ω)) for eachω∈Ω, where

ft(z) :=c(xt, yt)Tz for allz∈RE andt∈ {1, . . . , T −1}and pT := PLAYEROCO(f).

LetT ∈N,x∈XT, and y∈YT. Moreover, for eacht∈[T]define ft(z) :=c(xt, yt)Tz, ∀z∈RE,

pt:= PLAYEROCO(f1:t−1), Dt:= PLAYEROL(x1:t, y1:t−1).

Let t∈[T]. Note that Dt =xt(Ipt(·)). Thus, P(Dt= xt(e)) = P(Ipt =e) =pt(e) for each e∈ E.

Sincey is fixed, we have E[L(Dt, yt)] =X

e∈E

pt(e)L(xt(e), yt) =X

e∈E

pt(e)[c(xt, yt)](e) =pTtc(xt, yt) =ft(pt).

With that, by setting ROL := Regret(x,PLAYEROL,y,H) we have E[ROL] =E

hXT

t=1

L(Dt, yt)−min

i∈E T

X

t=1

L(xt(i), yt)i

=

T

X

t=1

E[L(Dt, yt)]−min

i∈E T

X

t=1

L(xt(i), yt)

=

T

X

t=1

ft(pt)−min

i∈E T

X

t=1

ft(ei) = Regret(PLAYEROCO,f, U).

We have claimed that the above proposition shows that a player oracle for an OCO instance with low regret guarantees was enough to build a randomized player oracle for the experts problem with good guarantees on the expected regret. However, as one might have noticed, there is a catch.

The above proposition proves an upper bound on the expected regret of the randomized player oracle againstfixed sequences of enemy choices. In other words, the above proposition only proves that such a player oracle has low expected regret against enemies which are oblivious to the choices of the player. This nuance might be often overlooked, but we find extremely insightful to deeply understand it. Thus, let us discuss this issue more formally.

Let P := (X, Y, D, L)be an online learning instance such that Y andD are each equipped with aσ-algebra and Lis measurable, and let NATURE, PLAYER, andENEMYbe nature, player, and measurable enemy oracles forP, respectively. Let H ⊆DX, letT ∈N, and suppose there is α∈R such that

Regret(x,PLAYER,y,H)≤α, ∀x∈XT,∀y∈YT. (2.12) The above bound, which is of the same type as the bound on Proposition 2.5.5, guarantees an upper bound of α on the regret ofPLAYER againstNATUREand ENEMY. To see this, note that if we set

(x,d,y) := OLP(NATURE,PLAYER,ENEMY, T),

we can plug such sequences xandy into (2.12) to obtain a bound on the regret of the player oracle in this game. Let us now look at what happens if we have a bound as the one in Proposition 2.6.2.

22SinceE is finite, it is a measure space when equipped with its powerset as itsσ-algebra.

Namely, let PLAYER be a randomized player oracle for P on a probability space (Ω,Σ,P), and suppose there isβ ∈R such that

E[Regret(x,PLAYER,y,H)]≤β, ∀x∈XT,∀y∈YT. (2.13) Even though (2.12) and (2.13) are similar, the latter does not directly yield a bound on the expected regret of PLAYERagainstNATURE andENEMY. To see this, define

(x,D,Y) := OLP(NATURE,PLAYER,ENEMY, T)

and lett∈ {2, . . . , T}. Note thatYtis a function ofx1:t andD1:t−1, and since the latter are random variables (sincet >1), we have thatYtis a random variable. That is, we cannot plugY in the place of y in (2.13). Thus, (2.13) only applies directly in the cases whereENEMY is oblivious(to the choices of the player), that is, when

ENEMY(x,d) = ENEMY(x,d0) ∀T ∈N\ {0},∀x∈XT,∀d,d0 ∈DT−1.

In words, an oblivious enemy oracle only depends on the nature queries, without taking into account in his decisions the choices of the player. An extreme case of obliviousness is when the enemy oracle only depends on thelength of the sequence of queries from nature (i.e. the number of the current round). In this latter case, the oblivious enemy oracle already knows which points it is going to pick at each round even before the game begins. Oblivious enemies are not any longer functions of the specific choices of the player, only of the nature queries, which are deterministic. Thus, the bound on (2.13) can be applied for games where the randomized player oracle plays against oblivious enemies.

At first sight, devising randomized player oracles with low expected regret against oblivious enemies seems much easier than devising such oracles to work against general/adaptive enemies.

Surprisingly, it seems that in most online learning problems, a player oracle which is guaranteed to attain low expected regret against any oblivious enemy is also guaranteed to attain low expected regret against adaptive enemies. For example, [24, Lemma 4.1] states that in the prediction with expert advice, if a randomized player oracle which only chooses actions suggested by one of the experts (which is the usual case) has expected regret against any oblivious enemy bounded byβ ∈R, then the expected regret against adaptive enemies is also bounded byβ. However, it is not clear how to adapt the arguments from the proof of [24, Lemma 4.1] to the framework we present here since, for example, the lemma By Cesa-Bianchi and Lugosi focuses on the experts’ problem.

Still, it seems that the claim that a player oracle which performs well against any oblivious enemy also performs well against any adaptive enemy should hold for more general online learning problems. For example, if the setY from where the enemy makes his choices is countable, then using expectations conditioned on the choices of the enemy and the law of total expectation seems to prove this claim. Additionally, in [28, Section 3], the authors prove that randomized player oracles for theonline optimization setting whose random variables picked by the player are all independent23 and that have low expected regret against oblivious enemies also have low expected regret against adaptive enemies. It seems that since the nature oracle for any OL instance is deterministic, this result should be extensible to the online learning setting almost seamlessly. However, some arguments used in the proof of [28, Theorem 3.1] do not seamlessly translate to the framework presented in this text (in particular, the application of their induction hypothesis).

Finally, this “equivalence” between the power of players for adaptive and oblivious enemies does not seem to hold for an online learning framework with bandit feedback [10], that is, the player oracle only receives the loss it suffered on past rounds, not the points picked by the enemy as in the classical online learning setting. Thus, having a clear proof for our framework of the claim discussed above and looking at the arguments that do not hold for the bandit setting may be very insightful.

23Which we suppose for the randomized player oracles of this section.