• Nenhum resultado encontrado

7.3 Contextual regret bound

7.3.2 Theoretical bound

Then, the contextual regret ofCL-BESAafter T rounds is upper bounded by E[RX,T]É¡

maxt∈[T]t

¢ 64 mint[T]2t

· Rη

s

2dlog³

λ1/2T2+T3(kµk2+σX)2 1/2

´

+λ1/2B

¸2

maxt[T]t

¢24σ2Xlog(T)−2λ kµk22 +

T

X

t=1

tI{min

t[T]tÉτ

+O(1) .

where the expectation is with respect to the internal randomness of the algorithm and of the additive reward noise, and where

τdef= 2σX

h Rη

s 2d

λ log³

λ1/2T2+T3(kµk2+σX)2 1/2

´ +Bi

. (7.8)

When the context perturbation isσX=0, thent reduces todef= |〈µ,θaθb〉¯

¯and we obtain

E[RX,T]É128Rη2d

∆ log(2T3kµk22/d)+O(1).

Algorithm Regret bound

CL-BESA O(dlog(T3))

OFUL(Abbasi-Yadkori et al., 2011) O(dlog2(T)) Confidence Bound(Dani et al., 2008) O(d2log3(T)) Thompson sampling(Agrawal and Goyal, 2012a) O(d2p

T) Table 7.1: Theoretical regret bounds for contextual bandit algorithms

Remark 7.3.1. For the sake of comparison, Table7.1lists the theoretical regret bounds associated with state of the art algorithms, showing the merits ofCL-BESA: the asso- ciated regret scales linearly with the dimension d , and logarithmically with the time horizon T . This result establishes the applicability of the sub-sampling technique to the contextual multi-armed bandit problem.

Remark 7.3.2. The restriction regards the minimum gap mint∈[T]t (see Eq. 7.8).

Note that a similar limitation is encountered in the distribution-dependent analysis of OFUL. Additionally, the authors explicitly assume a constant optimal arm in the distribution-dependent proof.

Remark 7.3.3. The assumptionλÊ6σ2Xlog(T)is mainly formulated due to technical reasons. Also, experiments (see7.4) suggest Equations7.7and7.8might be improved fur- ther. In practice,λ=λt =Ω(σ2X) log(t)is recommended, even though a good robustness w.r.t. the choice ofλis observed in the experiments.

7.3. Contextual regret bound

Proof. Step 1:Let?t be the optimal action at timetand¬?t the other action (Recall that there are onlyK =2 armsA ={a,b}).

By definition of the contextual regret at timeT, RX,T =

T

X

t=1

Xt,θ?tθIt

=

T

X

t=1

Xt,θ?tθ¬?t〉I{It = ¬?t}

=

T

X

t=1

tI{?t =a,It=b}+

T

X

t=1

tI{?t=b,It =a} , with∆t = 〈Xt,θ?tθ¬?t〉the instantaneous gap.

The event {It= ¬?t} involves〈Xt,θb?t,t1θb¬?t,t1〉, thus the instantaneous gap can be decomposed as

t=〈Xt,θ?tθb?t,t−1〉 + 〈Xt,θb¬?t,t−1θ¬?t

+ 〈Xt,θb?t,t−1θb¬?t,t−1〉. (7.9) Now, on the event {It= ¬?t}, either〈Xt,θb?t,t−1θb¬?t,t−1〉 <0, or〈Xt,θb?t,t−1θb¬?t,t−1〉 = 0 andNt1(¬?t)<Nt1(?t), or〈Xt,θb?t,t1θb¬?t,t1〉 =0 andNt1(¬?t)=Nt1(?t) and a random coinξtB(0.5) is tossed that gets value 1 (without loss of generality).

In any case, it holds that

Xt,θb?t,t−1θb¬?t,t−1〉I{It= ¬?t}É0 .

The parameterθb?t,t1=θbλ(S?t,t1(It?−1t )) involves the samplesS?t,t1and the sub- sampling index setIt?t1. For all deterministicxand constantδ∈(0, 1), and forS = S?t,t1, it holds by the proof of Theorem 2 from Abbasi-Yadkori et al. (2011) that with probability higher than 1−δ(w.r.t.S),

¯

¯

¯〈x,θbλ(S)−θ?t

¯

¯

¯É kxkVλ(S)−1Bλ,?t(S) where Bλ,?t(S)=Rη

s

2 log³det(Vλ(S)) λd/2δ

´

+λkθ?tkVλ(S)−1,

whereRηcomes from the sub-Gaussian assumption on the noise (7.2), andVλ(S)= X(S)>X(S)+λId. Since It−1?t is chosen independently onS?t,t−1, it is not difficult to see that the same bound holds forS?t,t−1(It−1?t ), with respect to all sources of randomness. Thus, combining this result together with the decomposition (7.9), and using the assumption thatXt is independent fromS

a∈ASa,t1, one deduces that with

probability higher than 1−2δ

tI{It = ¬?t

h X

a0{?t,¬?t}

kXtkV−1

a0,t−1

Ba0,t−1

iI{It = ¬?t} , (7.10)

where the short-hand notationsVa,t−1def= Vλt−1(Sa,t−1(Ita−1)) as well asBa,t−1def=Bλt−1,a(Sa,t−1(Ita−1)) are introduced for convenience.

Step 2: Now, kXtkV−1

a,t−1 appearing in Equation (7.10) is bounded. By definition of

Va,t−1, it holds that

kXtk2V1

a,t−1=Xt>³

λt1Id+ X

iIat−1

XsiXs>i´1

Xt.

This expression is decomposed by using the definition ofXs=µ+ξs for alls. Thus, on the one hand:

kXtkV−1

a,t−1É kµkV−1

a,t−1+ kξtk2λt−11/2, (7.11)

where one used the fact that minimum eigenvalue ofVa,t−11is lower-bounded byλt1. On the other hand, the following decomposition holds:

Va,t−1=λt−1Id+ |Ita−1|µµ>+ X

iIat1

ξsiξ>si +

³ X

iIt−1a

ξsi

´µ>+µ³ X

iIt−1a

ξ>si´

=V+E+E1+E2, where the following four matrices are introduced:

V =λt1Id+ |Ita−1|µµ>, E= X

iIta−1

ξsiξ>si, E1=µ³ X

iIta1

ξ>si´ and E2=

³ X

iIt−1a

ξsi

´µ>.

Now, µis an eigenvector of the matrixV with associated eigenvalueλµdef= λt−1+

|Ita−1|kµk22. Thus, it holds thatµ>V1µ=µ>V1Vλµ

µ =kµk

2

λµ2. The minimum eigenvalue ofEis non-negative.µis also an eigenvector of the rank 1 matrixE1with eigenvalue λµ,2=P

iIt−1aξsi,µ〉. Finally, the only non-zero eigenvalue of the rank 1 matrixE2is

7.3. Contextual regret bound

P

iIt−1aξsi,µ〉(associated to the vectorP

iIta−1ξsi).

Thus one deduces that the matrix norm of the vectorµcannot increase too much whenV is perturbed byE+E1+E2:λµis shifted by at mostλµ,2+min{λµ,2, 0}, which leads to the bound

kµkV21

a,t−1É kµk22

λt−1+ |It−1a |kµk22+2 min{P

iIta−1ξsi,µ〉, 0}, under the condition thatλt−1+ |It−1a |kµk22+2 min{P

iIt−1aξsi,µ〉, 0}>0.

This condition happens with high probability, provided that the noise is small enough.

Indeed, by the Chernoff method together with (7.3), it holds for all deterministic setI of sizen, and forδ∈(0, 1) that

PhX

iI

ξsi,µ〉 É −kµk2σX

q

2nlog(1/δ)i Éδ.

Thus, sinceIta−1is chosen independently on the samples, by a union bound over the possible values of the random size|It−1a | Ét−1 of the index set, it comes that on an event of probability higher than 1−δ,

X

iIt−1a

ξsi,µ〉 Ê −kµk2σX

q

2|Ita1|log((t−1)/δ) .

Thus, solving the conditionnkµk22+λt−1−2kµk2σX

p2nlog((t1)/δ)>0 inn, one ob- serves that whenλt−1>2σ2Xlog((t−1)/δ), the condition is satisfied for alln.

Step 3:Plug-in this result and the bound onkµk2V−1

a,t−1

in (7.11), and combining this to- gether with (7.10), one deduces that at timetsuch thatIt = ¬?t, then with probability higher than 1−6δ,

t É v u u u t

kµk22B¬?2

t,t1

λt1+nt1kµk22−2kµk2σX

q

2nt1log(t−1δ )

+ v u u u t

kµk22B?2

t,t−1

λt−1+nt−1kµk22−2kµk2σX

q

2nt−1log(tδ1)

+ kξtk2λ−1/2t−1 (B¬?t,t−1+B?t,t−1) (7.12) wherent−1def= |It¬?1t| = |It?t1| =min{Nt−1(¬?t),Nt−1(?t)}.

The next step is to simplify this expression by upper bounding both B¬?t,t−1 and B?t,t−1. To this end, one notes that on the one handλt−1kθ?tkV1

¬?t,t−1 Éλ1/2t−1kθ?tk2, and on the other hand, using the fact thatkXtk2ÉC for all context vectorXt, where

Cdef= kµk2+σX,

det(V¬?t,t−1)ɳtrace(V¬?t,t−1) d

´d

É

³λt−1d+nt−1C2 d

´d

É(λt−1+(t−1)C2/d)d. Thus, it holds for bothI0= ¬?t andI0=?t that

BI0,t1ÉRη v u u

t2dlog³λ1/2t−1+(td−1)Cλ1/22 t1

δ

´

+λ1/2t1kθI0k2. (7.13) For convenience, the first term on the left hand side in (7.13) is denoted bt1 = Rη

s

2dlog³λ

1/2 t1+(t−1)C2

dλ1/2

δ t−1

´

. Combining (7.13) together with (7.12), so far, and using the fact that kθk2ÉB for all θ ∈Θ, it is shown that for all t such that It = ¬?t, with probability higher than 1−6δ, then

t É

h 2kµk2

r

λt−1+nt−1kµk22−2kµk2σX

q

2nt−1log(t−1δ ) + kξtk2λt−11/2ih

bt1+λ1/2t−1B i

,

that is, after reorganizing the terms, and provided that the noise is not too large, i.e.

t Ê kξtk2(bt−1λt−11/2+B), then

nt−1−2σX

kµk2

r

2nt−1log(t−1

δ )É 4

µ bt1+λ1/2t1B

t− kξtk2(bt−1λ−1/2t1 +B)

2

λt−1

kµk22. (7.14) By introducing the thresholdτt = kξtk2(bt−1λ−1/2t−1 +B), the regret is decomposed into:

RX,T É

T

X

t=1

tI{It = ¬?t,∆t Êτt}+

T

X

t=1

tI{∆t<τt} .

Step 4: In this step, two cases are considered. First, the case when Nt−1(¬?tNt−1(?t). Second, whenNt−1(¬?t)<Nt−1(?t).

In the first situation, thennt−1=Nt−1(?t), and it comes from (7.14) that Nt−1(?t) cannot be too large. Indeed, for positive constants A,B, the conditionnAp

nÉB implies thatnÉB+A2/2¡

1+p

1+4B/A2), which in this case leads toNt1(?tuλ,t(δ)

7.3. Contextual regret bound

where

uλ,t(δ)def=4

µ bt−1+λ1/2t−1B

t− kξtk2(bt−1λt−11/2+B)

2

+4σ2Xlog(t−1δ ) kµk22

+ 4σX

q

2 log(tδ1) kµk2

µ bt−1+λ1/2t−1B

t− kξtk2(bt−1λt−11/2+B)

λt1

kµk22.

Likewise, in the second case when Nt−1(¬?tNt−1(?t), then one deduces that necessarilyNt−1(¬?tuλ,t(δ) with high probability, and thus a controlled regret.

From this point on, one can proceed similarly to the elementary proofs of the regret of theUCBalgorithm (Auer et al., 2002), as in (Maillard et al., 2011; Bubeck, 2010) or of BESA(Baransi et al., 2014). More precisely, it holds, since|∆t| É1 by assumption, that

RX,T É

T

X

t=1

tI{It = ¬?t∩∆t ÊτtNt1(?t)>uλ,t(δt)}

+

T

X

t=1

I{Nt1(?tuλt(δt)}+

T

X

t=1

tI{∆t<τt}

É

T

X

t=1

tI{It = ¬?t∩∆t ÊτtNt−1(¬?tuλ,t(δt)}

+6

T

X

t=1

δt+

T

X

t=1

I{Nt−1(?tuλt(δt)}+

T

X

t=1

tI{∆t<τt} ,

for any choice ofδt ∈(0, 1) fort ∈{1, . . . ,T}. In particular, this holds for the choice δt=t−2.

The first sum is splitted into the sum over the time steps for which?t=aand the sum over the time steps for which?t=b. Note that∆t isnotindependent fromIt. For the sum such that?t=b, it comes

T

X

t=1

tI{It=aNt−1(auλ,t(δt)}I{?t=b}

É(max

tÉTt)

T

X

t=1

I{It=aNt−1(a)Émax

tÉT uλ,t(δt)}I{?t=b}

É(max

tÉTt)(max

tÉT uλ,t(δt)) .

Likewise, a similar control can be obtained for the sum corresponding to?t=a.

In order to control the maximum terms, one uses the fact thatkξtk22Éσ2X. Thus, it holds that

maxtÉT uλ,t(δtuwhere

udef=16

·Rη r

2dlog

³λ1/2T T2+Tdλ3C1/22 T

´

+λ1/2T B mint∈[T]t

+σX

p6 log(T) kµk2

¸2

λT

kµk22, provided that the context-noise is small enough that

mint∈[T]t>τdef= 2σX

h Rη

s2d λT

log³

λ1/2T T2+T3C2 1/2T

´ +Bi

. To sum up this step, it has been shown so far thatRX,T is bounded as

RX,T É2¡

maxtÉTt

¢u+π2+

T

X

t=1

I{Nt−1(?tuλt(t−2)}

+

T

X

t=1

tI{min

t∈[T]t Éτ} .

Step 5:In order to conclude this proof, the term that now needs to be controlled is XT

t=1

I{Nt−1(?tuλt(δt)}= X

tTa

I{Nt−1(auλt(δt)}+ X

tTb

I{Nt−1(buλt(δt)} ,

whereTadef= {t∈[T] :?t=a} fora∈A .

To this end, a procedure similar to that used in (Baransi et al., 2014) is employed. More precisely, following the exact same steps as Steps 3 and 4 of the proof of Theorem 1 in (Baransi et al., 2014), it comes that

X

tTb

Ph

Nt−1(buλt(δt)i

Éc+ X

tTb,tÊc

buλt(δt)c

X

j=1

αb,a(Mt,j)+O(1) ,

wherecis a constant such thattÊcimpliestÊuλt(δt)(uλt(δt)+1),Mt ∈Nis such thatMt =O(log(t)) and where the functionαb,a(M,j) is defined by1

αb,a(M,j)=EZb

1:j

hPZ1:ja ,X

³

X,θb(Z1:ja )−θb(Z1:jb )〉 >0´Mi .

Here, one used explicitly the stochastic nature of the contextXt, to avoid having to deal with much more complex expressions. This comes at the price of restricting to

1Indeed, the tie eventX,θb(Z1:aj)θb(Z1:jb )〉 =0 has probability 0, since the distributions are diffuse.

7.3. Contextual regret bound

cases when the noise it not too strong. Here, Z1:jb denotes a set of j i.i.d. samples Zbj =(Xj,Yj) generated from the model considered in the introduction, when armb is chosen (that is, such thatYj= 〈Xj,θb〉 +ηi andXj=µ+ξj),Z1:ja denotes a similar set built usingθainstead ofθb, andX =µ+ξis generated by Nature.

Step 6: The next step of the proof is to control the quantityαb,a(M,j) for stepsTb (and likewiseαa,b(M,j) for stepsTa). In the sequel, the notationb =?is used to clarify thatbis the optimal arm in time-stepsTb. One wants to show that this decays exponentially fast to 0 with eitherM or j, so that the contribution to the regret is controlled. To begin with, the dot product is decomposed according to the different random variables

X,θ(Zb 1:ja )−θ(Zb 1:?j)〉 =〈µ,θ?θ(Zb 1:j? )〉 + 〈µ,θ(Zb 1:ja )−θa〉 + 〈ξ,θ(Zb 1:aj)−θa〉 + 〈ξ,θ?θ(Zb 1:j? )〉 + 〈ξ,θaθ?〉 −∆,

where∆= 〈µ,θ?θt〉. Then, it holds for allε>0 that PX

³〈ξ,θaθ?〉 Êε´ Éexp

³

ε2

2kθaθ?k22σ2X

´

. (7.15)

The term〈ξ,θ?θb(Z1:?j)〉is controlled a bit differently, by

PX

³〈ξ,θ?θb(Z1:j? )〉 Êε´

Éexp³

ε2λtj

2kθ?θ(Zb 1:j? )k2V?,

t jσ2X

´

, (7.16)

wheretjÊj corresponds to a time when armais sampled at least j times (note that this is a probability with respect toX, notZ1:?j).

Likewise,〈ξ,θ(Zb 1:ja )−θa〉is controlled by

PZ1:ja ,X

³〈ξ,θb(Z1:aj)−θa〉 Êε´ É

³λ1/2tj + jC2 1/2tj

´ exp³

λtj(εσXkθ?k2)2 2σ2XR2

´

, (7.17)

for allε>σXkθ?k2.

Finally, it has already been shown that, with probability higher than 1−δδ0with respect toZ1:ja , then

µ,θaθ(Zb 1:ja )〉 É

kµk2Rη v u u t2 log³

λ1/2t j + jC2

dλ1/2 t j

δ

´

+λ1/2tj kµk2kθak2

qλtj+jkµk22−2kµk2σX

p2jlog(1/δ0)

. (7.18)

Inverting this bound inδ, this gives for allεÊq kµk2

jkµk22+λt j, andε0Êλ1/2tj kθak2, that

PZ1:ja

³〈µ,θaθb(Z1:ja )〉 Êεε0´

Éexp³

¡jkµk22+λtj− kµk22/ε2¢2

8kµk22σ2Xj

´

λ1/2tj + jC2 1/2tj

´ exp³

¡ε0λ1/2tj kθak2

¢2

2Rη2

´

. (7.19)

Thus, by combining equations (7.15), (7.16), (7.17) and (7.19) together it comes PZ1:ja ,X

³

X,θ(Zb 1:ja )−θ(Zb 1:j? )〉 >0

´ÉPZ1:ja ,X

³〈µ,θ(Zb 1:ja )−θa〉 + 〈ξ,θ(Zb 1:ja )−θa

+ 〈ξ,θ?θ(Zb 1:j? )〉 + 〈ξ,θaθ?〉 >∆− 〈µ,θ?θ(Zb 1:j? )〉

´

É inf

ε0,...,ε4

n exp

³

¡jkµk22+λtj− kµk22/ε20¢2

8kµk22σ2Xj

´

+

³λ1/2tj + jC2 1/2tj

´ exp³

¡ε1λ1/2tj kθak2

¢2

2Rη2

´

+

³λ1/2tj + jC2 1/2tj

´ exp³

λtj(ε2σXkθ?k2)2 2σ2XRη2

´

+exp³

ε23λtj

2kθ?θ(Zb 1:j? )kV2?,

t jσ2X

´

+exp³

ε24 2kθaθ?k22σ2X

´ : ε0ε1+ε2+ε3+ε4É∆− 〈µ,θ?θb(Z1:j? )〉,

ε2>σXkθ?k2,ε0Ê kµk2

q

jkµk22+λtj

,ε1Êλ1/2tj kθak2

o .

By choosingε0=

p2kµk2

qjkµk22+λt j, ε1=2λ1/2tj kθak2,ε2=2σXkθ?k2,ε3=p

2σXkθ?k2, and ε4=p

2σXkθaθ?k2κ, for some positiveκ, the previous expression can be simplified.

This way, and using the fact thatλT Êλtj Êλj, finally establishes the following bound

7.3. Contextual regret bound

on the quantityα?,a(Mt,j) that has to be controlled:

α?,a(Mt,j)ÉEZ?

1:j

eκ2+e

¡jkµk2 2+λj

¢2 32kµk2

2σ2 Xj

+

³λ1/2T + jC2 1/2j

´h e

λjkθak2 2 2R2 +e

λjkθ?k2 2 2R2 i

+e

λjkθ?k

2 2 kθ?bθ(Z?

1:j)k2 V?,t j

Mt

I{E}+I{Ec}i

. (7.20)

For convenience, the eventE has been introduced:

E def=

n〈µ,θ?θb(Z1:j? )〉 É∆− 2p

2kµk2kθak2

q

jkµk22/λtj+1

−p

2σXB(1+p

2+2κ)o .

At this point, one notes that sinceλjÊ6σ2Xlog(j), all the exponential terms buteκ2 decay polynomially fast to 0 with j, and that forMt =O(log(t)), the first half of the bound onα?,a(Mt,j) decays polynomially fast to 0 witht, at a rate that can always be adjusted to bet−(1+β)for some smallβ>0. Thus, in order to control the regret term of Step 5, one deduces from this observation and (7.20) that it only remains to show that PZ1:j? (Ec) is small enough under our assumptions on the noise. From equation, (7.19), it is not difficult to see that

PZ?

1:j(Ece

(jkµk2 2+λj)2 32kµk2

2σ2 Xj

+

³λ1/2T + jC2 1/2j

´ e

λjkθ?k2 2 2R2

η ,

provided that the following condition holds 2p

2(kθak2+ kθ?k2) q

j/λT + kµk22 É∆−p

2σXB(1+p

2+2κ) .

Finally, in order for termPZ1:j? (Ec) to decay fast enough witht(at leastt−(1+β)), it is enough to chooseλtÊclog(T) for alltfor some constantc, which leads to

X

tTb,tÊc

buλt(δt)c

X

j=1

αb,a(Mt,j)=O(1).

Step 7:Now thatα?,a(Mt,j) is controlled,αa,b(Mt,j) can be treated similarly. Thus,

at the price of loosing a factor 2 (usingTaÉT andTbÉT), it has been shown that E[RX,T]=¡

maxt[T]t

¢U+ XT t=1

tI{min

t[T]tÉτ

+O(1) . where

Udef= 64 mint∈[T]2t

· Rη

s 2dlog

³λ1/2T T2+T3C2 1/2T

´

+λ1/2T B

¸2

+24σ2Xlog(T)−2λT

kµk22 and

τdef= 2σX

h Rη

s2d λT

log³

λ1/2T T2+T3C2 1/2T

´ +Bi

.

This concludes the proof after some cosmetic simplifications ofU using (a+b)2É 2(a2+b2) for positivea,b.