Theoretical bound - Contextual regret bound

7.3 Contextual regret bound

7.3.2 Theoretical bound

Then, the contextual regret ofCL-BESAafter T rounds is upper bounded by E[R_X_,T]É¡

maxt∈[T]∆t

¢ 64 mint∈[T]∆²_t

· R_η

2dlog³

λ^1/2T²+T³(kµk2+σX)² dλ^1/2

+λ^1/2B

¸2

+¡

maxt∈[T]∆t

¢24σ²_Xlog(T)−2λ kµk²₂ +

t=1

∆tI{min

t∈[T]∆tÉτ}´

+O(1) .

where the expectation is with respect to the internal randomness of the algorithm and of the additive reward noise, and where

τ^def= 2σX

h R_η

s 2d

λ log³

λ^1/2T²+T³(kµk2+σX)² dλ^1/2

´ +Bi

. (7.8)

When the context perturbation isσX=0, then∆t reduces to∆^def= |〈µ,θa−θb〉¯

¯and we obtain

E[RX,T]É128R_η²d

∆ log(2T³kµk²₂/d)+O(1).

Algorithm Regret bound

CL-BESA O(dlog(T³))

OFUL(Abbasi-Yadkori et al., 2011) O(dlog²(T)) Confidence Bound(Dani et al., 2008) O(d²log³(T)) Thompson sampling(Agrawal and Goyal, 2012a) O(d²p

T) Table 7.1: Theoretical regret bounds for contextual bandit algorithms

Remark 7.3.1. For the sake of comparison, Table7.1lists the theoretical regret bounds associated with state of the art algorithms, showing the merits ofCL-BESA: the asso- ciated regret scales linearly with the dimension d , and logarithmically with the time horizon T . This result establishes the applicability of the sub-sampling technique to the contextual multi-armed bandit problem.

Remark 7.3.2. The restriction regards the minimum gap mint∈[T]∆t (see Eq. 7.8).

Note that a similar limitation is encountered in the distribution-dependent analysis of OFUL. Additionally, the authors explicitly assume a constant optimal arm in the distribution-dependent proof.

Remark 7.3.3. The assumptionλÊ6σ²_Xlog(T)is mainly formulated due to technical reasons. Also, experiments (see7.4) suggest Equations7.7and7.8might be improved fur- ther. In practice,λ=λt =Ω(σ²_X) log(t)is recommended, even though a good robustness w.r.t. the choice ofλis observed in the experiments.

7.3. Contextual regret bound

Proof. Step 1:Let?t be the optimal action at timetand¬?t the other action (Recall that there are onlyK =2 armsA ={a,b}).

By definition of the contextual regret at timeT, RX,T =

t=1

〈Xt,θ_?t−θIt〉

t=1

〈X_t,θ_?t−θ_¬?t〉I{It = ¬?t}

t=1

∆tI{?t =a,I_t=b}+

t=1

∆tI{?t=b,I_t =a} , with∆t = 〈X_t,θ_?t−θ_¬?t〉the instantaneous gap.

The event {I_t= ¬?t} involves〈X_t,θb_?t,t−1−θb_¬?t,t−1〉, thus the instantaneous gap can be decomposed as

∆t=〈Xt,θ_?_t−θb_?_t_,t−1〉 + 〈Xt,θb_¬?_t,t−1−θ_¬?_t〉

+ 〈X_t,θb?t,t−1−θb¬?t,t−1〉. (7.9) Now, on the event {It= ¬?t}, either〈Xt,θb_?_t,t−1−θb_¬?_t,t−1〉 <0, or〈Xt,θb_?_t,t−1−θb_¬?_t,t−1〉 = 0 andNt−1(¬?t)<Nt−1(?t), or〈Xt,θb_?t,t−1−θb_¬?t,t−1〉 =0 andNt−1(¬?t)=Nt−1(?t) and a random coinξt∼B(0.5) is tossed that gets value 1 (without loss of generality).

In any case, it holds that

〈Xt,θb_?_t_,t−1−θb_¬?_t_,t−1〉I{It= ¬?t}É0 .

The parameterθb_?t,t−1=θb_λ(S_?t,t−1(I_t^?₋₁^t )) involves the samplesS_?t,t−1and the sub- sampling index setI_t^?₋^t₁. For all deterministicxand constantδ∈(0, 1), and forS = S_?t,t−1, it holds by the proof of Theorem 2 from Abbasi-Yadkori et al. (2011) that with probability higher than 1−δ(w.r.t.S),

¯〈x,θb_λ(S)−θ_?_t〉

¯É kxk_V_λ_(S₎⁻¹B_λ,?_t(S) where B_λ_,_?_t(S)=R_η

2 log³det(V_λ(S)) λ^d^/2δ

+λkθ?tkV_λ(S)⁻¹,

whereR_ηcomes from the sub-Gaussian assumption on the noise (7.2), andV_λ(S)= X(S)^>X(S)+λI_d. Since I_t−1^?^t is chosen independently onS?t,t−1, it is not difficult to see that the same bound holds forS_?_t_,t−1(I_t−1^?^t ), with respect to all sources of randomness. Thus, combining this result together with the decomposition (7.9), and using the assumption thatX_t is independent fromS

a∈ASa,t−1, one deduces that with

probability higher than 1−2δ

∆tI{I_t = ¬?t}É

h X

a⁰∈{?t,¬?t}

kX_tk_V⁻¹

a0,t−1

B_a0,t−1

iI{I_t = ¬?t} , (7.10)

where the short-hand notationsV_a,t−1^def= V_λ_t−1(Sa,t−1(I_t^a₋₁)) as well asB_a,t−1^def=B_λ_t−1_,a(Sa,t−1(I_t^a₋₁)) are introduced for convenience.

Step 2: Now, kX_tk_V⁻¹

a,t−1 appearing in Equation (7.10) is bounded. By definition of

V_a,t₋₁, it holds that

kX_tk²_V−1

a,t−1=X_t^>³

λt−1I_d+ X

i∈I^a_t−1

X_s_iX_s^>_i´₋1

X_t.

This expression is decomposed by using the definition ofXs=µ+ξs for alls. Thus, on the one hand:

kXtk_V⁻¹

a,t−1É kµk_V⁻¹

a,t−1+ kξtk2λ⁻_t₋₁^1/2, (7.11)

where one used the fact that minimum eigenvalue ofV_a,t⁻¹₋₁is lower-bounded byλt−1. On the other hand, the following decomposition holds:

Va,t−1=λt−1I_d+ |I_t^a₋₁|µµ^>+ X

i∈I^a_t₋₁

ξs_iξ^>_s_i +

³ X

i∈I_t−1^a

ξsi

´µ^>+µ³ X

i∈I_t−1^a

ξ^>_s_i´

=V+E+E₁+E₂, where the following four matrices are introduced:

V =λt−1I_d+ |I_t^a₋₁|µµ^>, E= X

i∈I_t^a₋₁

ξsiξ^>_s_i, E1=µ³ X

i∈I_t^a₋₁

ξ^>_s_i´ and E₂=

³ X

i∈I_t−1^a

ξsi

´µ^>.

Now, µis an eigenvector of the matrixV with associated eigenvalueλ_µ^def= λt−1+

|I_t^a₋₁|kµk²₂. Thus, it holds thatµ^>V⁻¹µ=µ^>V⁻¹^V_λ^µ

µ =^kµk

λµ2. The minimum eigenvalue ofEis non-negative.µis also an eigenvector of the rank 1 matrixE₁with eigenvalue λ_µ,2=P

i∈I_t−1^a 〈ξsi,µ〉. Finally, the only non-zero eigenvalue of the rank 1 matrixE₂is

7.3. Contextual regret bound

i∈I_t−1^a 〈ξsi,µ〉(associated to the vectorP

i∈I_t^a₋₁ξsi).

Thus one deduces that the matrix norm of the vectorµcannot increase too much whenV is perturbed byE+E₁+E₂:λ_µis shifted by at mostλ_µ,2+min{λ_µ,2, 0}, which leads to the bound

kµk_V²−1

a,t−1É kµk²₂

λt−1+ |I_t−1^a |kµk²₂+2 min{P

i∈I_t^a₋₁〈ξs_i,µ〉, 0}, under the condition thatλ_t−1+ |I_t−1^a |kµk²₂+2 min{P

i∈I_t−1^a 〈ξsi,µ〉, 0}>0.

This condition happens with high probability, provided that the noise is small enough.

Indeed, by the Chernoff method together with (7.3), it holds for all deterministic setI of sizen, and forδ∈(0, 1) that

PhX

i∈I

〈ξsi,µ〉 É −kµk2σX

2nlog(1/δ)i Éδ.

Thus, sinceI_t^a₋₁is chosen independently on the samples, by a union bound over the possible values of the random size|I_t−1^a | Ét−1 of the index set, it comes that on an event of probability higher than 1−δ,

i∈I_t−1^a

〈ξsi,µ〉 Ê −kµk2σX

2|I_t^a₋₁|log((t−1)/δ) .

Thus, solving the conditionnkµk²₂+λt−1−2kµk2σX

p2nlog((t₁)/δ)>0 inn, one ob- serves that whenλt−1>2σ²_Xlog((t−1)/δ), the condition is satisfied for alln.

Step 3:Plug-in this result and the bound onkµk²_V₋₁

a,t−1

in (7.11), and combining this together with (7.10), one deduces that at timetsuch thatI_t = ¬?t, then with probability higher than 1−6δ,

∆t É v u u u t

kµk²₂B_¬?²

t,t−1

λt−1+n_t₋₁kµk²₂−2kµk2σX

2n_t₋₁log(^t⁻¹_δ )

+ v u u u t

kµk²₂B_?²

t,t−1

λt−1+n_t₋₁kµk²₂−2kµk2σX

2n_t₋₁log(^t⁻_δ¹)

+ kξtk2λ^−1/2_t−1 (B_¬?_t_,t₋₁+B_?_t_,t₋₁) (7.12) wheren_t−1^def= |I_t^¬?₋₁^t| = |I_t^?₋^t₁| =min{N_t₋₁(¬?t),N_t−1(?t)}.

The next step is to simplify this expression by upper bounding both B_¬?_t_,t−1 and B_?_t_,t−1. To this end, one notes that on the one handλt−1kθ?tk_V⁻¹

¬?t,t−1 Éλ^1/2_t₋₁kθ?tk2, and on the other hand, using the fact thatkX_tk2ÉC for all context vectorX_t, where

C^def= kµk2+σX,

det(V_¬?_t,t−1)É³trace(V_¬?_t_,t₋₁) d

´d

³λt−1d+n_t−1C² d

´d

É(λt−1+(t−1)C²/d)^d. Thus, it holds for bothI⁰= ¬?t andI⁰=?t that

B_I0,t−1ÉR_η v u u

t2dlog³λ^1/2_t−1+^(t_d^−1)C_λ1/2² t−1

+λ^1/2_t₋₁kθI⁰k2. (7.13) For convenience, the first term on the left hand side in (7.13) is denoted b_t₋₁ = R_η

2dlog³^λ

1/2 t−1+^(t^−1)C²

dλ1/2

δ t−1

. Combining (7.13) together with (7.12), so far, and using the fact that kθk2ÉB for all θ ∈Θ, it is shown that for all t such that I_t = ¬?t, with probability higher than 1−6δ, then

∆t É

h 2kµk2

λt−1+nt−1kµk²₂−2kµk2σX

2nt−1log(^t⁻¹_δ ) + kξtk2λ⁻_t−1^1/2ih

bt−1+λ^1/2_t₋₁B i

that is, after reorganizing the terms, and provided that the noise is not too large, i.e.

∆t Ê kξtk2(b_t₋₁λ⁻_t₋₁^1/2+B), then

n_t₋₁−2σX

kµk2

2n_t₋₁log(t−1

δ )É 4

µ b_t₋₁+λ^1/2_t₋₁B

∆t− kξtk2(b_t₋₁λ^−1/2_t₋₁ +B)

¶2

−λt−1

kµk²₂. (7.14) By introducing the thresholdτt = kξtk2(b_t₋₁λ^−1/2_t−1 +B), the regret is decomposed into:

RX,T É

t=1

∆tI{It = ¬?t,∆t Êτt}+

t=1

∆tI{∆t<τt} .

Step 4: In this step, two cases are considered. First, the case when N_t−1(¬?t)Ê Nt−1(?t). Second, whenNt−1(¬?t)<Nt−1(?t).

In the first situation, thenn_t₋₁=N_t−1(?t), and it comes from (7.14) that N_t−1(?t) cannot be too large. Indeed, for positive constants A,B, the conditionn−Ap

nÉB implies thatnÉB+A²/2¡

1+p

1+4B/A²), which in this case leads toN_t₋₁(?t)Éu_λ,t(δ)

7.3. Contextual regret bound

where

u_λ,t(δ)^def=4

µ b_t₋₁+λ^1/2_t−1B

∆t− kξtk2(b_t₋₁λ⁻_t₋₁^1/2+B)

¶2

+4σ²_Xlog(^t−1_δ ) kµk²₂

+ 4σX

2 log(^t⁻_δ¹) kµk2

µ b_t₋₁+λ^1/2_t₋₁B

∆t− kξtk2(bt−1λ⁻_t−1^1/2+B)

−λt−1

kµk²₂.

Likewise, in the second case when N_t−1(¬?t)ÉN_t−1(?t), then one deduces that necessarilyN_t−1(¬?t)Éu_λ_,t(δ) with high probability, and thus a controlled regret.

From this point on, one can proceed similarly to the elementary proofs of the regret of theUCBalgorithm (Auer et al., 2002), as in (Maillard et al., 2011; Bubeck, 2010) or of BESA(Baransi et al., 2014). More precisely, it holds, since|∆t| É1 by assumption, that

R_X_,T É

t=1

∆tI{It = ¬?t∩∆t Êτt∩N_t₋₁(?t)>u_λ,t(δt)}

t=1

I{N_t₋₁(?t)Éu_λ_t(δt)}+

t=1

∆tI{∆t<τt}

t=1

∆tI{I_t = ¬?t∩∆t Êτt∩N_t₋₁(¬?t)Éu_λ_,t(δt)}

t=1

δt+

t=1

I{N_t−1(?t)Éu_λ_t(δt)}+

t=1

∆tI{∆t<τt} ,

for any choice ofδt ∈(0, 1) fort ∈{1, . . . ,T}. In particular, this holds for the choice δt=t⁻².

The first sum is splitted into the sum over the time steps for which?t=aand the sum over the time steps for which?t=b. Note that∆t isnotindependent fromIt. For the sum such that?t=b, it comes

t=1

∆tI{I_t=a∩N_t₋₁(a)Éu_λ,t(δt)}I{?t=b}

É(max

tÉT ∆t)

t=1

I{I_t=a∩N_t−1(a)Émax

tÉT u_λ,t(δt)}I{?t=b}

É(max

tÉT ∆t)(max

tÉT u_λ,t(δt)) .

Likewise, a similar control can be obtained for the sum corresponding to?t=a.

In order to control the maximum terms, one uses the fact thatkξtk²₂Éσ²_X. Thus, it holds that

maxtÉT u_λ,t(δt)Éuwhere

u^def=16

·R_η r

2dlog

³λ^1/2_T T²+^T_d_λ³^C1/2² T

+λ^1/2_T B min_t_∈[T_]∆t

+σX

p6 log(T) kµk2

¸2

− λT

kµk²₂, provided that the context-noise is small enough that

mint∈[T]∆t>τ^def= 2σX

h R_η

s2d λT

log³

λ^1/2_T T²+T³C² dλ^1/2_T

´ +Bi

. To sum up this step, it has been shown so far thatR_X_,T is bounded as

R_X_,T É2¡

maxtÉT ∆t

¢u+π²+

t=1

I{N_t−1(?t)Éu_λ_t(t⁻²)}

t=1

∆tI{min

t∈[T]∆t Éτ} .

Step 5:In order to conclude this proof, the term that now needs to be controlled is XT

t=1

I{Nt−1(?t)Éu_λ_t(δt)}= X

t∈Ta

I{Nt−1(a)Éu_λ_t(δt)}+ X

t∈Tb

I{Nt−1(b)Éu_λ_t(δt)} ,

whereT_a^def= {t∈[T] :?t=a} fora∈A .

To this end, a procedure similar to that used in (Baransi et al., 2014) is employed. More precisely, following the exact same steps as Steps 3 and 4 of the proof of Theorem 1 in (Baransi et al., 2014), it comes that

t∈Tb

N_t−1(b)Éu_λ_t(δt)i

Éc+ X

t∈Tb,tÊc

bu_λt(δt)c

j=1

αb,a(M_t,j)+O(1) ,

wherecis a constant such thattÊcimpliestÊu_λ_t(δt)(u_λ_t(δt)+1),M_t ∈Nis such thatM_t =O(log(t)) and where the functionαb,a(M,j) is defined by¹

αb,a(M,j)=E_Z^b

1:j

hPZ_1:j^a ,X

〈X,θb(Z_1:j^a )−θb(Z_1:j^b )〉 >0´Mi .

Here, one used explicitly the stochastic nature of the contextX_t, to avoid having to deal with much more complex expressions. This comes at the price of restricting to

1Indeed, the tie event〈X,θb(Z_1:^a_j)−θb(Z_1:j^b )〉 =0 has probability 0, since the distributions are diffuse.

7.3. Contextual regret bound

cases when the noise it not too strong. Here, Z_1:j^b denotes a set of j i.i.d. samples Z^b_j =(Xj,Yj) generated from the model considered in the introduction, when armb is chosen (that is, such thatY_j= 〈X_j,θb〉 +ηi andX_j=µ+ξj),Z_1:j^a denotes a similar set built usingθainstead ofθb, andX =µ+ξis generated by Nature.

Step 6: The next step of the proof is to control the quantityαb,a(M,j) for stepsT_b (and likewiseαa,b(M,j) for stepsTa). In the sequel, the notationb =?is used to clarify thatbis the optimal arm in time-stepsT_b. One wants to show that this decays exponentially fast to 0 with eitherM or j, so that the contribution to the regret is controlled. To begin with, the dot product is decomposed according to the different random variables

〈X,θ(Zb _1:jâ )−θ(Zb _1:^?_j)〉 =〈µ,θ_?−θ(Zb _1:j^? )〉 + 〈µ,θ(Zb _1:jâ )−θa〉 + 〈ξ,θ(Zb _1:â_j)−θa〉 + 〈ξ,θ_?−θ(Zb _1:j^? )〉 + 〈ξ,θa−θ_?〉 −∆,

where∆= 〈µ,θ_?−θt〉. Then, it holds for allε>0 that PX

³〈ξ,θa−θ_?〉 Êε´ Éexp

− ε²

2kθa−θ_?k²₂σ²_X

. (7.15)

The term〈ξ,θ_?−θb(Z_1:^?_j)〉is controlled a bit differently, by

³〈ξ,θ?−θb(Z_1:j^? )〉 Êε´

Éexp³

− ε²λt_j

2kθ_?−θ(Zb _1:j^? )k²_V_?_,

t jσ²_X

, (7.16)

wheret_jÊj corresponds to a time when armais sampled at least j times (note that this is a probability with respect toX, notZ_1:^?_j).

Likewise,〈ξ,θ(Zb _1:j^a )−θa〉is controlled by

PZ_1:j^a ,X

³〈ξ,θb(Z_1:^a_j)−θa〉 Êε´ É

³λ^1/2_t_j + jC² dλ^1/2_t_j

´ exp³

−λtj(ε−σXkθ^?k2)² 2σ²_XR²

, (7.17)

for allε>σXkθ^?k2.

Finally, it has already been shown that, with probability higher than 1−δ−δ⁰with respect toZ_1:j^a , then

〈µ,θa−θ(Zb _1:j^a )〉 É

kµk2R_η v u u t2 log³

λ^1/2_{t j} + ^jC²

dλ1/2 t j

+λ^1/2_t_j kµk2kθak2

qλtj+jkµk²₂−2kµk2σX

p2jlog(1/δ⁰)

. (7.18)

Inverting this bound inδ, this gives for allεÊ^q ^kµk²

jkµk²₂+λ_{t j}, andε⁰Êλ^1/2_t_j kθak2, that

PZ_1:j^a

³〈µ,θa−θb(Z_1:j^a )〉 Êεε⁰´

Éexp³

−

¡jkµk²₂+λtj− kµk²₂/ε²¢2

8kµk²₂σ²_Xj

+³

λ^1/2_t_j + jC² dλ^1/2_t_j

´ exp³

−

¡ε⁰−λ^1/2_t_j kθak2

¢2

2R_η²

. (7.19)

Thus, by combining equations (7.15), (7.16), (7.17) and (7.19) together it comes PZ_1:j^a ,X

〈X,θ(Zb _1:j^a )−θ(Zb _1:j^? )〉 >0

´ÉPZ_1:j^a ,X

³〈µ,θ(Zb _1:j^a )−θa〉 + 〈ξ,θ(Zb _1:j^a )−θa〉

+ 〈ξ,θ_?−θ(Zb _1:j^? )〉 + 〈ξ,θa−θ_?〉 >∆− 〈µ,θ_?−θ(Zb _1:j^? )〉

É inf

ε0,...,ε4

n exp

−

¡jkµk²₂+λtj− kµk²₂/ε²₀¢2

8kµk²₂σ²_Xj

³λ^1/2_t_j + jC² dλ^1/2_t_j

´ exp³

−

¡ε1−λ^1/2_t_j kθak2

¢2

2R_η²

³λ^1/2t_j + jC² dλ^1/2_t_j

´ exp³

−λt_j(ε2−σXkθ^?k2)² 2σ²_XR_η²

+exp³

− ε²₃λt_j

2kθ_?−θ(Zb _1:j^? )k_V²_?_,

t jσ²_X

+exp³

− ε²₄ 2kθa−θ?k²₂σ²_X

´ : ε0ε1+ε2+ε3+ε4É∆− 〈µ,θ?−θb(Z_1:j^? )〉,

ε2>σXkθ^?k2,ε0Ê kµk2

jkµk²₂+λt_j

,ε1Êλ^1/2_t_j kθak2

o .

By choosingε0=

p2kµk2

qjkµk²₂+λ_{t j}, ε1=2λ^1/2_t_j kθak2,ε2=2σXkθ^?k2,ε3=p

2σXkθ^?k2, and ε4=p

2σXkθ^a−θ_?k2κ, for some positiveκ, the previous expression can be simplified.

This way, and using the fact thatλT Êλtj Êλj, finally establishes the following bound

7.3. Contextual regret bound

on the quantityα?,a(M_t,j) that has to be controlled:

α_?,a(Mt,j)ÉE_Z^?

1:j

hµ

e^−κ²+e⁻

¡jkµk2 2+λj

¢2 32kµk2

2σ2 Xj

³λ^1/2_T + jC² dλ^1/2_j

´h e⁻

λjkθak2 2 2R2 +e⁻

λjkθ?k2 2 2R2 i

− ^λj^kθ^?^k

2 2 kθ?−bθ(Z?

1:j)k2 V?,t j

¶Mt

I{E}+I{E^c}i

. (7.20)

For convenience, the eventE has been introduced:

E ^def=

n〈µ,θ?−θb(Z_1:j^? )〉 É∆− 2p

2kµk2kθak2

jkµk²₂/λtj+1

−p

2σXB(1+p

2+2κ)o .

At this point, one notes that sinceλjÊ6σ²_Xlog(j), all the exponential terms bute^−κ² decay polynomially fast to 0 with j, and that forMt =O(log(t)), the first half of the bound onα_?,a(M_t,j) decays polynomially fast to 0 witht, at a rate that can always be adjusted to bet^−(1+β)for some smallβ>0. Thus, in order to control the regret term of Step 5, one deduces from this observation and (7.20) that it only remains to show that PZ_1:j^? (E^c) is small enough under our assumptions on the noise. From equation, (7.19), it is not difficult to see that

P_Z^?

1:j(E^c)Ée⁻

(jkµk2 2+λj)2 32kµk2

2σ2 Xj

³λ^1/2_T + jC² dλ^1/2_j

´ e⁻

λjkθ?k2 2 2R2

η ,

provided that the following condition holds 2p

2(kθak2+ kθ?k2) q

j/λT + kµk⁻₂² É∆−p

2σXB(1+p

2+2κ) .

Finally, in order for termPZ_1:j^? (E^c) to decay fast enough witht(at leastt^−(1+β)), it is enough to chooseλtÊclog(T) for alltfor some constantc, which leads to

t∈T_b,tÊc_∆

bu_λt(δt)c

j=1

αb,a(M_t,j)=O(1).

Step 7:Now thatα_?,a(M_t,j) is controlled,αa,b(M_t,j) can be treated similarly. Thus,

at the price of loosing a factor 2 (usingT_aÉT andT_bÉT), it has been shown that E[RX,T]=¡

maxt∈[T]∆t

¢U+ XT t=1

∆tI{min

t∈[T]∆tÉτ}´

+O(1) . where

U^def= 64 mint∈[T]∆²_t

· R_η

s 2dlog

³λ^1/2_T T²+T³C² dλ^1/2_T

+λ^1/2_T B

¸2

+24σ²_Xlog(T)−2λT

kµk²₂ and

τ^def= 2σX

h R_η

s2d λT

log³

λ^1/2_T T²+T³C² dλ^1/2_T

´ +Bi

This concludes the proof after some cosmetic simplifications ofU using (a+b)²É 2(a²+b²) for positivea,b.

No documento Contributions to Multi-Armed Bandits : Risk-Awareness and Sub-Sampling for Linear Contextual Bandits (páginas 104-115)