Analysis - The max-min approach - Contributions to Multi-Armed Bandits : Risk-Awareness and Sub

6.2 The max-min approach

6.2.2 Analysis

The goal ofMINis to find the best essential infimum of the arms, where the essential infimum is defined as follows.

Definition 6.2.1. Letνbe a probability distribution and X∼νa real random variable.

Theessential infimuma_νofνis defined by a_ν^def= max

a∈R {P(X <a)=0}

Let us make the mild assumption that distributionνsatisfies Equation6.2, as illustrated in Fig.6.1. Then, the empirical min taken over a uniform sampling according toν, converges exponentially fast toward the essential infimum.

Lemma 6.2.1. Letνbe a bounded distribution with support in[0, 1], with a its essential infimum, and assume thatνsatisfies:

∃A>0,∀ε>0,P(X Éa+ε)ÊAε with X ∼ν (6.2) Let x₁. . .x_t be a t-sample independently drawn afterν. Then, the minimum value over xu,u=1 . . .t goes exponentially fast to a:

P( min

1ÉuÉtx_uÊa+ε)Éexp(−t Aε) (6.3) Proof. As thexuare iid, it comes:

P( min

1ÉuÉtx_uÊa+ε) = P(∀u∈{1, . . . ,t},x_uÊa+ε)

u=1

P(xuÊa+ε)É(1−Aε)^t Éexp(−t Aε) where the last inequality follows from (1−z)Éexp(−z).

0

1

Figure 6.1: Illustration of an example of distribution satisfying the assumption of Equation6.2.

The assumption supporting the above result (Equation6.2, illustrated in Figure6.1) does not require the positive constant A to be known; it only requires that there is enough probability mass in the neighborhood of a. Equation6.3confirms the exponential convergence towardaas a function ofA.

A surprising result is that, under this assumption, the convergence toward the minimum might be faster than the convergence toward the mean. Specifically, the Hoeffd- ing bound on the convergence toward the mean decreases exponentially like−tε², whereas after Equation6.3the convergence toward the min decreases exponentially

6.2. The max-min approach

like−t Aε(asεis going to 0,Aε>>ε²).

Under this assumption, it follows without difficulty that with high probability the empirical min of each arm is exponentially close to its essential infimum after each arm has been triedttimes.

Lemma 6.2.2. Letν1. . .νK denote K distributions with bounded support in[0, 1]with a_i their essential infimum.

Assume thatνi satisfies Equation6.2for some constant A for i=1 . . .K .

Denoting x_i,u, u=1 . . .t , i=1 . . .K , t samples independently drawn afterνi, one has:

P(∃i ∈{1, . . . ,K}, min

1ÉuÉtxi,uÊai+ε)ÉKexp(−t Aε) (6.4)

Proof. After Lemma6.2.1,

P(∃i∈{1, . . . ,K}, min

1ÉuÉtx_i_,uÊa_i+ε) É 1−(1−(1−Aε)^t)^K É K(1−Aε)^t

É Kexp(−t Aε)

Where the first inequality follows from (1−z)^y≥1−y.z and the second inequality from (1−z)≤exp(−z), which concludes the proof.

Let us consider the two distinct goals of finding the arm with best expectation, and the arm with best essential infimum. If these goals are compatible (that is, the optimal arm in terms of min value also is the optimal arm in terms of mean value), then the MIN algorithm achieves a logarithmic regret under the above assumptions.

Proposition 6.2.1. Letν1. . .νK denote K distributions with bounded support in[0, 1]

withµi (resp. a_i) their mean (resp. their essential infimum). Further assume that νi satisfies Equation 6.2for some constant A for i =1 . . .K , and that the arm with best mean valueµ^?also is the arm with best min value a^?. Let∆_µ,i =µ^?−µi (resp.

∆a,i=a^?−a_i) denote the mean-related (resp. essential infimum-related) margins.

Then, with probability at least1−δ, the cumulative pseudo-regret is upper bounded as follows:

R_t ÉK−1 A

∆µ,max

∆a,min

log µt K

+(K−1)∆_µ,max (6.5)

with∆a,min= min

i:∆a,i>0∆a,i and∆µ,max= max

i:∆_µ,i>0∆µ,i.

Furthermore, the expectation of the cumulative pseudo-regret is upper-bounded as follows for t sufficiently large (tÊ^K_A⁻¹_∆^∆_µ^a,min_,max):

E[Rt]ÉK−1 A

∆_µ,max

∆a,min

µ log

µt²K A K−1

∆a,min

∆_µ,max

¶ +1

+(K−1)∆_µ,max (6.6)

Proof. Suppose that there exists a single optimal arm (this point will be discussed below). Taking inspiration from (Sani et al., 2012a), letxi,ube independent samples drawn afterνi, and define the eventE as follows:

E = n

∀i ∈{1, . . . ,K},∀s∈{1, . . .u} minx_i,s−a_i É ε u o

(6.7) The probability of the complementary eventE^c is bounded after Lemma6.2.2:

P(E^c) = P(∃i∈{1, . . . ,K},∃u∈{1, . . . ,t}, min

1ÉsÉux_i,s−a_i> ε u) É

u=1

P(∃i∈{1, . . . ,K}, min

1ÉsÉux_i,s−a_i> ε u) É min(1,t Kexp(−Aε))

Lett>1 be an iteration where a sub-optimal armi is selected; this implies that the empirical min of thei-th arm is higher than that of the best armi^?:

1ÉuÉNmin_i?,t−1x_i?,u< min

1ÉuÉNi,t−1

x_i_,u ⇔ min

1ÉuÉNi?,t−1x_i?,u−a_i

| {z }

Êai?−ai=∆a,i

< min

1ÉuÉNi,t−1

x_i_,u−a_i

| {z }

É_Ni,t^ε

−1(?)

where (?) holds if t belongs to the event setE, thus with probability at least 1− t K exp(−Aε) after Lemma6.2.2.

It follows that with probability at least 1−t K exp(−Aε) ε

N_i,t−1Ê∆a,i henceN_i_,t É ε

∆a,i +1 sinceN_i_,t ≤N_i,t₋₁+1.

With probability at least 1−t K exp(−Aε), the cumulative regretR_t can thus be upper- bounded:

R_t =

i=1

N_i_,t∆µ,i É

i=1

( ε

∆a,i +1)∆µ,i (6.8)

É (K−1)

µ∆_µ,max

∆a,mi nε+∆_µ,max

with∆_µ,max= max

1ÉiÉK∆_µ,iand∆a,min= min

1ÉiÉK∆a,i

Finally, by settingδ=min(1,t Kexp(−Aε)), it follows that with probability 1−δ, R_t ÉK−1

∆_µ,max

∆a,mi n

log(t K

δ )+(K−1)∆µ,max (6.9) In the case where there existsk>1 optimal arms, Eq.6.9still holds, by replacingK−1

6.2. The max-min approach

factor withK−k.

The expectation of the cumulative regret is similarly upper-bounded:

E[Rt] = E[RtI_E]+E[RtI_E^c] É K−1

∆µ,max

∆a,mi n

log(t K

δ )+(K−1)∆µ,max+δtby boundingR_t bytoverE^C.

Fortsufficiently large (tÊ^K_A⁻¹^∆_∆^µ_{a,mi n}^,max), by settingδ=^K_{t A}⁻¹^∆_∆^µ_{a,mi n}^,max, it comes:

E[R_t]ÉK−1 A

∆_µ,max

∆a,mi n

µ log

µ t²K A (K−1)

∆a,min

∆_µ,max

¶ +1

+(K−1)∆µ,max (6.10) which concludes the proof.

Remark 6.2.1. This result can be compared to the regret bound derived for theUCB algorithm, similarly achieving a logarithmic regret (Auer et al., 2002):

E[Rt]É8 X

i6=i^?

logt

∆µ,i +(1+π² 3 )

XK i=1

∆_µ,i (6.11)

where i^?stands for the index of the optimal arm. MINandUCBthus both achieve a logarithmic regret uniformly over t , where the regret rate involves the mean-related margin inUCB(resp. the min-related margin inMIN, multiplied by the constant A).

A stronger result can be obtained forMIN, under an additional assumption on the lower tails of the arm distributions.

Proposition 6.2.2. With same notations and assumptions as in Prop. 6.2.1, let us further assume that for every i=1 . . .K,∆µ,i=µ^?−µi Éa^?−a_i=∆a,i.

Then, with probability at least1−δ, R_t ÉK−1

A log(t K

δ )+(K−1)∆µ,max

with∆µ,max=max

i ∆µ,i.

Furthermore, if t>^K_A⁻¹, the expectation of R_t is upper-bounded as follows:

E[Rt]ÉK−1 A

µ log

µt²K A K−1

¶ +1

+(K−1)∆_µ,max (6.12) Proof. The proof closely follows the one of Prop.6.2.1, noting that in Eq. 6.8∆a,i is now greater than∆_µ,i. Settingδ=^(K_{t A}⁻¹⁾concludes the proof of Eq.6.12.

Discussion. The comparison of UCB and MIN only makes sense when the two goals are the same, naturally, that is, the same arm is optimal in terms of expectation and in terms of essential infimum. When it is the case, Eq.6.12and Eq. 6.11suggests that MIN might outperform UCB when: i) margins∆_µ,i are small, ii) distributionsνi are not too thin in the neighborhood of the essential infimum (that is, Ais not too small), and iii) the assumption∆a,iÊ∆µ,i holds.

Note that the latter assumption boils down to considering that better arms (in the sense of their mean) also have a narrower support for their lower tail, thus a lower risk. If this assumption does not hold however, then risk minimization and regret minimization are likely to be conflicting objectives.

A last remark is that the assumptions done (lower bounded distribution density in the neighborhood of the essential minimum and mean-related margin greater than the minimum-related margin) yield a significant improvement compared to the continu- ous distribution-free case, where the optimal regret is known to beO(p

t) (Audibert and Bubeck, 2009, 2010).

No documento Contributions to Multi-Armed Bandits : Risk-Awareness and Sub-Sampling for Linear Contextual Bandits (páginas 70-75)