7.3 Contextual regret bound
7.3.2 Theoretical bound
Then, the contextual regret ofCL-BESAafter T rounds is upper bounded by E[RX,T]É¡
maxt∈[T]∆t
¢ 64 mint∈[T]∆2t
· Rη
s
2dlog³
λ1/2T2+T3(kµk2+σX)2 dλ1/2
´
+λ1/2B
¸2
+¡
maxt∈[T]∆t
¢24σ2Xlog(T)−2λ kµk22 +
T
X
t=1
∆tI{min
t∈[T]∆tÉτ}´
+O(1) .
where the expectation is with respect to the internal randomness of the algorithm and of the additive reward noise, and where
τdef= 2σX
h Rη
s 2d
λ log³
λ1/2T2+T3(kµk2+σX)2 dλ1/2
´ +Bi
. (7.8)
When the context perturbation isσX=0, then∆t reduces to∆def= |〈µ,θa−θb〉¯
¯and we obtain
E[RX,T]É128Rη2d
∆ log(2T3kµk22/d)+O(1).
Algorithm Regret bound
CL-BESA O(dlog(T3))
OFUL(Abbasi-Yadkori et al., 2011) O(dlog2(T)) Confidence Bound(Dani et al., 2008) O(d2log3(T)) Thompson sampling(Agrawal and Goyal, 2012a) O(d2p
T) Table 7.1: Theoretical regret bounds for contextual bandit algorithms
Remark 7.3.1. For the sake of comparison, Table7.1lists the theoretical regret bounds associated with state of the art algorithms, showing the merits ofCL-BESA: the asso- ciated regret scales linearly with the dimension d , and logarithmically with the time horizon T . This result establishes the applicability of the sub-sampling technique to the contextual multi-armed bandit problem.
Remark 7.3.2. The restriction regards the minimum gap mint∈[T]∆t (see Eq. 7.8).
Note that a similar limitation is encountered in the distribution-dependent analysis of OFUL. Additionally, the authors explicitly assume a constant optimal arm in the distribution-dependent proof.
Remark 7.3.3. The assumptionλÊ6σ2Xlog(T)is mainly formulated due to technical reasons. Also, experiments (see7.4) suggest Equations7.7and7.8might be improved fur- ther. In practice,λ=λt =Ω(σ2X) log(t)is recommended, even though a good robustness w.r.t. the choice ofλis observed in the experiments.
7.3. Contextual regret bound
Proof. Step 1:Let?t be the optimal action at timetand¬?t the other action (Recall that there are onlyK =2 armsA ={a,b}).
By definition of the contextual regret at timeT, RX,T =
T
X
t=1
〈Xt,θ?t−θIt〉
=
T
X
t=1
〈Xt,θ?t−θ¬?t〉I{It = ¬?t}
=
T
X
t=1
∆tI{?t =a,It=b}+
T
X
t=1
∆tI{?t=b,It =a} , with∆t = 〈Xt,θ?t−θ¬?t〉the instantaneous gap.
The event {It= ¬?t} involves〈Xt,θb?t,t−1−θb¬?t,t−1〉, thus the instantaneous gap can be decomposed as
∆t=〈Xt,θ?t−θb?t,t−1〉 + 〈Xt,θb¬?t,t−1−θ¬?t〉
+ 〈Xt,θb?t,t−1−θb¬?t,t−1〉. (7.9) Now, on the event {It= ¬?t}, either〈Xt,θb?t,t−1−θb¬?t,t−1〉 <0, or〈Xt,θb?t,t−1−θb¬?t,t−1〉 = 0 andNt−1(¬?t)<Nt−1(?t), or〈Xt,θb?t,t−1−θb¬?t,t−1〉 =0 andNt−1(¬?t)=Nt−1(?t) and a random coinξt∼B(0.5) is tossed that gets value 1 (without loss of generality).
In any case, it holds that
〈Xt,θb?t,t−1−θb¬?t,t−1〉I{It= ¬?t}É0 .
The parameterθb?t,t−1=θbλ(S?t,t−1(It?−1t )) involves the samplesS?t,t−1and the sub- sampling index setIt?−t1. For all deterministicxand constantδ∈(0, 1), and forS = S?t,t−1, it holds by the proof of Theorem 2 from Abbasi-Yadkori et al. (2011) that with probability higher than 1−δ(w.r.t.S),
¯
¯
¯〈x,θbλ(S)−θ?t〉
¯
¯
¯É kxkVλ(S)−1Bλ,?t(S) where Bλ,?t(S)=Rη
s
2 log³det(Vλ(S)) λd/2δ
´
+λkθ?tkVλ(S)−1,
whereRηcomes from the sub-Gaussian assumption on the noise (7.2), andVλ(S)= X(S)>X(S)+λId. Since It−1?t is chosen independently onS?t,t−1, it is not difficult to see that the same bound holds forS?t,t−1(It−1?t ), with respect to all sources of randomness. Thus, combining this result together with the decomposition (7.9), and using the assumption thatXt is independent fromS
a∈ASa,t−1, one deduces that with
probability higher than 1−2δ
∆tI{It = ¬?t}É
h X
a0∈{?t,¬?t}
kXtkV−1
a0,t−1
Ba0,t−1
iI{It = ¬?t} , (7.10)
where the short-hand notationsVa,t−1def= Vλt−1(Sa,t−1(Ita−1)) as well asBa,t−1def=Bλt−1,a(Sa,t−1(Ita−1)) are introduced for convenience.
Step 2: Now, kXtkV−1
a,t−1 appearing in Equation (7.10) is bounded. By definition of
Va,t−1, it holds that
kXtk2V−1
a,t−1=Xt>³
λt−1Id+ X
i∈Iat−1
XsiXs>i´−1
Xt.
This expression is decomposed by using the definition ofXs=µ+ξs for alls. Thus, on the one hand:
kXtkV−1
a,t−1É kµkV−1
a,t−1+ kξtk2λ−t−11/2, (7.11)
where one used the fact that minimum eigenvalue ofVa,t−1−1is lower-bounded byλt−1. On the other hand, the following decomposition holds:
Va,t−1=λt−1Id+ |Ita−1|µµ>+ X
i∈Iat−1
ξsiξ>si +
³ X
i∈It−1a
ξsi
´µ>+µ³ X
i∈It−1a
ξ>si´
=V+E+E1+E2, where the following four matrices are introduced:
V =λt−1Id+ |Ita−1|µµ>, E= X
i∈Ita−1
ξsiξ>si, E1=µ³ X
i∈Ita−1
ξ>si´ and E2=
³ X
i∈It−1a
ξsi
´µ>.
Now, µis an eigenvector of the matrixV with associated eigenvalueλµdef= λt−1+
|Ita−1|kµk22. Thus, it holds thatµ>V−1µ=µ>V−1Vλµ
µ =kµk
2
λµ2. The minimum eigenvalue ofEis non-negative.µis also an eigenvector of the rank 1 matrixE1with eigenvalue λµ,2=P
i∈It−1a 〈ξsi,µ〉. Finally, the only non-zero eigenvalue of the rank 1 matrixE2is
7.3. Contextual regret bound
P
i∈It−1a 〈ξsi,µ〉(associated to the vectorP
i∈Ita−1ξsi).
Thus one deduces that the matrix norm of the vectorµcannot increase too much whenV is perturbed byE+E1+E2:λµis shifted by at mostλµ,2+min{λµ,2, 0}, which leads to the bound
kµkV2−1
a,t−1É kµk22
λt−1+ |It−1a |kµk22+2 min{P
i∈Ita−1〈ξsi,µ〉, 0}, under the condition thatλt−1+ |It−1a |kµk22+2 min{P
i∈It−1a 〈ξsi,µ〉, 0}>0.
This condition happens with high probability, provided that the noise is small enough.
Indeed, by the Chernoff method together with (7.3), it holds for all deterministic setI of sizen, and forδ∈(0, 1) that
PhX
i∈I
〈ξsi,µ〉 É −kµk2σX
q
2nlog(1/δ)i Éδ.
Thus, sinceIta−1is chosen independently on the samples, by a union bound over the possible values of the random size|It−1a | Ét−1 of the index set, it comes that on an event of probability higher than 1−δ,
X
i∈It−1a
〈ξsi,µ〉 Ê −kµk2σX
q
2|Ita−1|log((t−1)/δ) .
Thus, solving the conditionnkµk22+λt−1−2kµk2σX
p2nlog((t1)/δ)>0 inn, one ob- serves that whenλt−1>2σ2Xlog((t−1)/δ), the condition is satisfied for alln.
Step 3:Plug-in this result and the bound onkµk2V−1
a,t−1
in (7.11), and combining this to- gether with (7.10), one deduces that at timetsuch thatIt = ¬?t, then with probability higher than 1−6δ,
∆t É v u u u t
kµk22B¬?2
t,t−1
λt−1+nt−1kµk22−2kµk2σX
q
2nt−1log(t−1δ )
+ v u u u t
kµk22B?2
t,t−1
λt−1+nt−1kµk22−2kµk2σX
q
2nt−1log(t−δ1)
+ kξtk2λ−1/2t−1 (B¬?t,t−1+B?t,t−1) (7.12) wherent−1def= |It¬?−1t| = |It?−t1| =min{Nt−1(¬?t),Nt−1(?t)}.
The next step is to simplify this expression by upper bounding both B¬?t,t−1 and B?t,t−1. To this end, one notes that on the one handλt−1kθ?tkV−1
¬?t,t−1 Éλ1/2t−1kθ?tk2, and on the other hand, using the fact thatkXtk2ÉC for all context vectorXt, where
Cdef= kµk2+σX,
det(V¬?t,t−1)ɳtrace(V¬?t,t−1) d
´d
É
³λt−1d+nt−1C2 d
´d
É(λt−1+(t−1)C2/d)d. Thus, it holds for bothI0= ¬?t andI0=?t that
BI0,t−1ÉRη v u u
t2dlog³λ1/2t−1+(td−1)Cλ1/22 t−1
δ
´
+λ1/2t−1kθI0k2. (7.13) For convenience, the first term on the left hand side in (7.13) is denoted bt−1 = Rη
s
2dlog³λ
1/2 t−1+(t−1)C2
dλ1/2
δ t−1
´
. Combining (7.13) together with (7.12), so far, and using the fact that kθk2ÉB for all θ ∈Θ, it is shown that for all t such that It = ¬?t, with probability higher than 1−6δ, then
∆t É
h 2kµk2
r
λt−1+nt−1kµk22−2kµk2σX
q
2nt−1log(t−1δ ) + kξtk2λ−t−11/2ih
bt−1+λ1/2t−1B i
,
that is, after reorganizing the terms, and provided that the noise is not too large, i.e.
∆t Ê kξtk2(bt−1λ−t−11/2+B), then
nt−1−2σX
kµk2
r
2nt−1log(t−1
δ )É 4
µ bt−1+λ1/2t−1B
∆t− kξtk2(bt−1λ−1/2t−1 +B)
¶2
−λt−1
kµk22. (7.14) By introducing the thresholdτt = kξtk2(bt−1λ−1/2t−1 +B), the regret is decomposed into:
RX,T É
T
X
t=1
∆tI{It = ¬?t,∆t Êτt}+
T
X
t=1
∆tI{∆t<τt} .
Step 4: In this step, two cases are considered. First, the case when Nt−1(¬?t)Ê Nt−1(?t). Second, whenNt−1(¬?t)<Nt−1(?t).
In the first situation, thennt−1=Nt−1(?t), and it comes from (7.14) that Nt−1(?t) cannot be too large. Indeed, for positive constants A,B, the conditionn−Ap
nÉB implies thatnÉB+A2/2¡
1+p
1+4B/A2), which in this case leads toNt−1(?t)Éuλ,t(δ)
7.3. Contextual regret bound
where
uλ,t(δ)def=4
µ bt−1+λ1/2t−1B
∆t− kξtk2(bt−1λ−t−11/2+B)
¶2
+4σ2Xlog(t−1δ ) kµk22
+ 4σX
q
2 log(t−δ1) kµk2
µ bt−1+λ1/2t−1B
∆t− kξtk2(bt−1λ−t−11/2+B)
¶
−λt−1
kµk22.
Likewise, in the second case when Nt−1(¬?t)ÉNt−1(?t), then one deduces that necessarilyNt−1(¬?t)Éuλ,t(δ) with high probability, and thus a controlled regret.
From this point on, one can proceed similarly to the elementary proofs of the regret of theUCBalgorithm (Auer et al., 2002), as in (Maillard et al., 2011; Bubeck, 2010) or of BESA(Baransi et al., 2014). More precisely, it holds, since|∆t| É1 by assumption, that
RX,T É
T
X
t=1
∆tI{It = ¬?t∩∆t Êτt∩Nt−1(?t)>uλ,t(δt)}
+
T
X
t=1
I{Nt−1(?t)Éuλt(δt)}+
T
X
t=1
∆tI{∆t<τt}
É
T
X
t=1
∆tI{It = ¬?t∩∆t Êτt∩Nt−1(¬?t)Éuλ,t(δt)}
+6
T
X
t=1
δt+
T
X
t=1
I{Nt−1(?t)Éuλt(δt)}+
T
X
t=1
∆tI{∆t<τt} ,
for any choice ofδt ∈(0, 1) fort ∈{1, . . . ,T}. In particular, this holds for the choice δt=t−2.
The first sum is splitted into the sum over the time steps for which?t=aand the sum over the time steps for which?t=b. Note that∆t isnotindependent fromIt. For the sum such that?t=b, it comes
T
X
t=1
∆tI{It=a∩Nt−1(a)Éuλ,t(δt)}I{?t=b}
É(max
tÉT ∆t)
T
X
t=1
I{It=a∩Nt−1(a)Émax
tÉT uλ,t(δt)}I{?t=b}
É(max
tÉT ∆t)(max
tÉT uλ,t(δt)) .
Likewise, a similar control can be obtained for the sum corresponding to?t=a.
In order to control the maximum terms, one uses the fact thatkξtk22Éσ2X. Thus, it holds that
maxtÉT uλ,t(δt)Éuwhere
udef=16
·Rη r
2dlog
³λ1/2T T2+Tdλ3C1/22 T
´
+λ1/2T B mint∈[T]∆t
+σX
p6 log(T) kµk2
¸2
− λT
kµk22, provided that the context-noise is small enough that
mint∈[T]∆t>τdef= 2σX
h Rη
s2d λT
log³
λ1/2T T2+T3C2 dλ1/2T
´ +Bi
. To sum up this step, it has been shown so far thatRX,T is bounded as
RX,T É2¡
maxtÉT ∆t
¢u+π2+
T
X
t=1
I{Nt−1(?t)Éuλt(t−2)}
+
T
X
t=1
∆tI{min
t∈[T]∆t Éτ} .
Step 5:In order to conclude this proof, the term that now needs to be controlled is XT
t=1
I{Nt−1(?t)Éuλt(δt)}= X
t∈Ta
I{Nt−1(a)Éuλt(δt)}+ X
t∈Tb
I{Nt−1(b)Éuλt(δt)} ,
whereTadef= {t∈[T] :?t=a} fora∈A .
To this end, a procedure similar to that used in (Baransi et al., 2014) is employed. More precisely, following the exact same steps as Steps 3 and 4 of the proof of Theorem 1 in (Baransi et al., 2014), it comes that
X
t∈Tb
Ph
Nt−1(b)Éuλt(δt)i
Éc+ X
t∈Tb,tÊc
buλt(δt)c
X
j=1
αb,a(Mt,j)+O(1) ,
wherecis a constant such thattÊcimpliestÊuλt(δt)(uλt(δt)+1),Mt ∈Nis such thatMt =O(log(t)) and where the functionαb,a(M,j) is defined by1
αb,a(M,j)=EZb
1:j
hPZ1:ja ,X
³
〈X,θb(Z1:ja )−θb(Z1:jb )〉 >0´Mi .
Here, one used explicitly the stochastic nature of the contextXt, to avoid having to deal with much more complex expressions. This comes at the price of restricting to
1Indeed, the tie event〈X,θb(Z1:aj)−θb(Z1:jb )〉 =0 has probability 0, since the distributions are diffuse.
7.3. Contextual regret bound
cases when the noise it not too strong. Here, Z1:jb denotes a set of j i.i.d. samples Zbj =(Xj,Yj) generated from the model considered in the introduction, when armb is chosen (that is, such thatYj= 〈Xj,θb〉 +ηi andXj=µ+ξj),Z1:ja denotes a similar set built usingθainstead ofθb, andX =µ+ξis generated by Nature.
Step 6: The next step of the proof is to control the quantityαb,a(M,j) for stepsTb (and likewiseαa,b(M,j) for stepsTa). In the sequel, the notationb =?is used to clarify thatbis the optimal arm in time-stepsTb. One wants to show that this decays exponentially fast to 0 with eitherM or j, so that the contribution to the regret is controlled. To begin with, the dot product is decomposed according to the different random variables
〈X,θ(Zb 1:ja )−θ(Zb 1:?j)〉 =〈µ,θ?−θ(Zb 1:j? )〉 + 〈µ,θ(Zb 1:ja )−θa〉 + 〈ξ,θ(Zb 1:aj)−θa〉 + 〈ξ,θ?−θ(Zb 1:j? )〉 + 〈ξ,θa−θ?〉 −∆,
where∆= 〈µ,θ?−θt〉. Then, it holds for allε>0 that PX
³〈ξ,θa−θ?〉 Êε´ Éexp
³
− ε2
2kθa−θ?k22σ2X
´
. (7.15)
The term〈ξ,θ?−θb(Z1:?j)〉is controlled a bit differently, by
PX
³〈ξ,θ?−θb(Z1:j? )〉 Êε´
Éexp³
− ε2λtj
2kθ?−θ(Zb 1:j? )k2V?,
t jσ2X
´
, (7.16)
wheretjÊj corresponds to a time when armais sampled at least j times (note that this is a probability with respect toX, notZ1:?j).
Likewise,〈ξ,θ(Zb 1:ja )−θa〉is controlled by
PZ1:ja ,X
³〈ξ,θb(Z1:aj)−θa〉 Êε´ É
³λ1/2tj + jC2 dλ1/2tj
´ exp³
−λtj(ε−σXkθ?k2)2 2σ2XR2
´
, (7.17)
for allε>σXkθ?k2.
Finally, it has already been shown that, with probability higher than 1−δ−δ0with respect toZ1:ja , then
〈µ,θa−θ(Zb 1:ja )〉 É
kµk2Rη v u u t2 log³
λ1/2t j + jC2
dλ1/2 t j
δ
´
+λ1/2tj kµk2kθak2
qλtj+jkµk22−2kµk2σX
p2jlog(1/δ0)
. (7.18)
Inverting this bound inδ, this gives for allεÊq kµk2
jkµk22+λt j, andε0Êλ1/2tj kθak2, that
PZ1:ja
³〈µ,θa−θb(Z1:ja )〉 Êεε0´
Éexp³
−
¡jkµk22+λtj− kµk22/ε2¢2
8kµk22σ2Xj
´
+³
λ1/2tj + jC2 dλ1/2tj
´ exp³
−
¡ε0−λ1/2tj kθak2
¢2
2Rη2
´
. (7.19)
Thus, by combining equations (7.15), (7.16), (7.17) and (7.19) together it comes PZ1:ja ,X
³
〈X,θ(Zb 1:ja )−θ(Zb 1:j? )〉 >0
´ÉPZ1:ja ,X
³〈µ,θ(Zb 1:ja )−θa〉 + 〈ξ,θ(Zb 1:ja )−θa〉
+ 〈ξ,θ?−θ(Zb 1:j? )〉 + 〈ξ,θa−θ?〉 >∆− 〈µ,θ?−θ(Zb 1:j? )〉
´
É inf
ε0,...,ε4
n exp
³
−
¡jkµk22+λtj− kµk22/ε20¢2
8kµk22σ2Xj
´
+
³λ1/2tj + jC2 dλ1/2tj
´ exp³
−
¡ε1−λ1/2tj kθak2
¢2
2Rη2
´
+
³λ1/2tj + jC2 dλ1/2tj
´ exp³
−λtj(ε2−σXkθ?k2)2 2σ2XRη2
´
+exp³
− ε23λtj
2kθ?−θ(Zb 1:j? )kV2?,
t jσ2X
´
+exp³
− ε24 2kθa−θ?k22σ2X
´ : ε0ε1+ε2+ε3+ε4É∆− 〈µ,θ?−θb(Z1:j? )〉,
ε2>σXkθ?k2,ε0Ê kµk2
q
jkµk22+λtj
,ε1Êλ1/2tj kθak2
o .
By choosingε0=
p2kµk2
qjkµk22+λt j, ε1=2λ1/2tj kθak2,ε2=2σXkθ?k2,ε3=p
2σXkθ?k2, and ε4=p
2σXkθa−θ?k2κ, for some positiveκ, the previous expression can be simplified.
This way, and using the fact thatλT Êλtj Êλj, finally establishes the following bound
7.3. Contextual regret bound
on the quantityα?,a(Mt,j) that has to be controlled:
α?,a(Mt,j)ÉEZ?
1:j
hµ
e−κ2+e−
¡jkµk2 2+λj
¢2 32kµk2
2σ2 Xj
+
³λ1/2T + jC2 dλ1/2j
´h e−
λjkθak2 2 2R2 +e−
λjkθ?k2 2 2R2 i
+e
− λjkθ?k
2 2 kθ?−bθ(Z?
1:j)k2 V?,t j
¶Mt
I{E}+I{Ec}i
. (7.20)
For convenience, the eventE has been introduced:
E def=
n〈µ,θ?−θb(Z1:j? )〉 É∆− 2p
2kµk2kθak2
q
jkµk22/λtj+1
−p
2σXB(1+p
2+2κ)o .
At this point, one notes that sinceλjÊ6σ2Xlog(j), all the exponential terms bute−κ2 decay polynomially fast to 0 with j, and that forMt =O(log(t)), the first half of the bound onα?,a(Mt,j) decays polynomially fast to 0 witht, at a rate that can always be adjusted to bet−(1+β)for some smallβ>0. Thus, in order to control the regret term of Step 5, one deduces from this observation and (7.20) that it only remains to show that PZ1:j? (Ec) is small enough under our assumptions on the noise. From equation, (7.19), it is not difficult to see that
PZ?
1:j(Ec)Ée−
(jkµk2 2+λj)2 32kµk2
2σ2 Xj
+
³λ1/2T + jC2 dλ1/2j
´ e−
λjkθ?k2 2 2R2
η ,
provided that the following condition holds 2p
2(kθak2+ kθ?k2) q
j/λT + kµk−22 É∆−p
2σXB(1+p
2+2κ) .
Finally, in order for termPZ1:j? (Ec) to decay fast enough witht(at leastt−(1+β)), it is enough to chooseλtÊclog(T) for alltfor some constantc, which leads to
X
t∈Tb,tÊc∆
buλt(δt)c
X
j=1
αb,a(Mt,j)=O(1).
Step 7:Now thatα?,a(Mt,j) is controlled,αa,b(Mt,j) can be treated similarly. Thus,
at the price of loosing a factor 2 (usingTaÉT andTbÉT), it has been shown that E[RX,T]=¡
maxt∈[T]∆t
¢U+ XT t=1
∆tI{min
t∈[T]∆tÉτ}´
+O(1) . where
Udef= 64 mint∈[T]∆2t
· Rη
s 2dlog
³λ1/2T T2+T3C2 dλ1/2T
´
+λ1/2T B
¸2
+24σ2Xlog(T)−2λT
kµk22 and
τdef= 2σX
h Rη
s2d λT
log³
λ1/2T T2+T3C2 dλ1/2T
´ +Bi
.
This concludes the proof after some cosmetic simplifications ofU using (a+b)2É 2(a2+b2) for positivea,b.