Experimental validation - MARABOUT : The Multi-Armed Risk-Aware Bandit OUThandled Algorithm 75

6.6 MARABOUT : The Multi-Armed Risk-Aware Bandit OUThandled Algorithm 75

6.6.3 Experimental validation

on the regret. In practice however, it will be seen thatβ=0is a recommandable value.

This empirical finding is in agreement with the fact that risk-aversion is better served by a conservative algorithmic behavior.

6.6. MARABOUT: The Multi-Armed Risk-Aware Bandit OUThandled Algorithm

exploration coefficientC to two distincts valuesC=3 andC=10⁻⁴.

The CVaR regret, averaged over 100 independent runs, is displayed on Fig.6.7.

0 5 10 15 20 25 30 35 40

0 50 100 150 200

MCVaR pseudo-regret

iteration MARABOUT C = 3 MARABOUT C = 10^-4

(a) Problem 1.α=0.5.T=200

0 50 100 150 200 250

0 500 1000 1500 2000

MCVaR pseudo-regret

iteration MARABOUT C = 3 MARABOUT C = 10^-4

(b) Problem 1.α=0.5.T=2000

0 100 200 300 400 500

0 500 1000 1500 2000

MCVaR pseudo-regret

iteration MARABOUT C = 3 MARABOUT C = 10^-4

0 20 40 60 80 100

0 500 1000 1500 2000

MCVaR pseudo-regret

iteration MARABOUT C = 3 MARABOUT C = 10^-4

(d) Problem 3.α=0.01.T=2000

Figure 6.7:mCV aRPseudo-Regret forMARABOUTaveraged out of 100 runs, on three 2-armed artificial MAB problems (see text) withC ∈©

10⁻⁴, 3ª

andβ=0. Top row:

problem 1 for time horizonT=200 (left) andT=2000 (right). Bottom line: problem 2 (left) and problem 3 (right).

The main lesson learned from the experimental results, depicted on Fig. 6.7is that MARABOUTcan achieve a logarithmic regret on all three problems, albeit with a small value ofC(C=10⁻⁴). This suggests that when dealing with such short time horizons, the exploration strength must be limited; and that the theoretical lower bound for C>2 might be practically too large. In a more detailed way:

• The first problem corresponds to an easy setting, with a large mCVaR margin (∆mCV aR =0.5) and a high risk levelα=.5, enabling to use a fair amount of information. In this case, themCV aR pseudo-regret is logarithmic even for high values ofC (C =3) provided that the time horizon is sufficiently large (Fig. 6.7, top row, right). In this problem, a standard MAB algorithm would

equally select both arms since the mean margin is 0, leading to a high variance of the rewards. In such situations,MARABOUTcontributes to the stability of the gathered rewards.

• For the second problem (Fig.6.7(c)), the main challenge lies in the tiny value ofα=0.01. Achieving a logarithmic regret rate in this challenging setting and over such short time horizons is a encouraging result for the applicability of the approach.

Actually, such a setting is ideally suited to the use of risk-aware approaches. On the one hand, the low mean margin (∆=5.10⁻³) adversely affects classical MAB algorithms. On the other hand, since∆mCV aR is large, the loss encountered in worst cases is large, which is specifically what one wants to be protected from.

In the meanwhile, the large mCVaR margin makes it easier forMARABOUT. Furthermore, the above does not depend on the value ofα. The fact that logarithmic regrets are attained for lowαvalues thus also suggests a wide range of successful application of the approach.

Note that the standard deviation is increased compared to Pb 1; a tentative interpretation for this large standard deviation is that the estimate ofmCV aR_α,i is very sensitive to the first samples; many trials might be required to revise the (too optimistic) estimate.

• On the third problem, the most challenging one out of the three,MARABOUT also successfully manages to reach a logarithmic regret. The same remarks as above hold, where the higher variance of the results is explained by the smaller

∆mCV aR.

It must however be said that a practitionner might want to use a standard MAB algorithm in such a setting, for the losses in the worst cases remain small (∆mCV aR =10⁻³) and optimistic MAB algorithms might make a more efficient use of all the samples to compensate for the smaller margin∆.

Comparison withMARAB

The purpose of this section is now to compare the behavior ofMARABandMARABOUT. Figure6.8presents the regret afterT =2000=100K (left) andT=4000=200K (right) iterations. For a parameterC one order of magnitude lower than forMARAB,MARABOUT is able to reach comparable level of performances on all the collection of 1000 artificial problems. In the same manner, Figure6.9compares the average sorted rewards of MARABvsMARABOUTfor the problem with the smallest (left) and largest (right) variance on the optimal arm. Like previously, a larger sensitivity to the parameter tuning is seen in the low variance case. However,MARABOUTis able to reach in both cases the

6.6. MARABOUT: The Multi-Armed Risk-Aware Bandit OUThandled Algorithm

performance ofMARABforα=20%. Finally, Figure6.10is interested with the real- world energy problem and shows the ability of the approach to gather good rewards in the 37.5% worst cases whenαis low (left). Fig.6.10also shows that, given aCvalue, theMARABOUTperformance is not sensitive toα. Secondly,MARABOUTperformance is optimal for very low values ofC(C=10⁻⁷). In the meanwhile,MARABOUTis dominated byMARAB (with low sensitivity with respect to bothαandC, Fig 6.6). This fact is explained by the pessimistic and conservative strategy ofMARAB; the lack of exploration does not harm its performance in small time horizon.

0 100 200 300 400 500 600 700 800

0 200 400 600 800 1000

Regret

Problem instances MARAB C = 10^-6, α = 0.2 MARABOUT C = 10^-7, α = 0.2

0 100 200 300 400 500 600 700 800

0 200 400 600 800 1000

Regret

Problem instances MARAB C = 10^-6, α = 0.2 MARABOUT C = 10^-7, α = 0.2

(a) Time horizon 2,000 (b) Time horizon 4,000

Figure 6.8: Comparative distribution of empirical cumulative regret ofMARABand MARABOUTon 1,000 problem instances (independently sorted for each algorithm) for time horizonsT =2, 000 andT =4, 000.

0 0.2 0.4 0.6 0.8 1

0 500 1000 1500 2000

Reward

Average sorted rewards MARABOUT C = 10^-7, α = 0.1%

MARABOUT C = 10^-7, α = 1%

MARABOUT C = 10^-7, α = 20%

MARAB C = 10^-6, α = 20%

0 0.2 0.4 0.6 0.8 1

0 500 1000 1500 2000

Reward

Average sorted rewards MARABOUT C = 10^-7, α = 0.1%

MARABOUT C = 10^-7, α = 1%

MARABOUT C = 10^-7, α = 20%

MARAB C = 10^-6, α = 20%

Figure 6.9: Comparative risk avoidance of MARAB andMARABOUT for time horizon T =2, 000 for two artificial problems with low (left column) and high (right column) variance of the optimal arm.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0 100 200 300 400 500 600 700

Rewards

Average sorted rewards MARABOUT C = 10^-7, α = 0.1%

MARABOUT C = 10^-7, α = 10%

MARABOUT C = 10², α = 0.1%

MARABOUT C = 10², α = 10%

MARAB C = 10^-6, α = 10%

0 10 20 30 40 50 60

0 500 1000 1500 2000

Regret

Iterations MARABOUT C = 10^-7, α = 0.1%

MARABOUT C = 10^-7, α = 10%

MARABOUT C = 10², α = 0.1%

MARABOUT C = 10², α = 10%

MARAB C = 10^-6, α = 10%

Figure 6.10: Comparative performance ofMARABandMARABOUTon a real-world energy management problem. Left: sorted instant rewards (truncated to the 37.5% worst cases for readability). Right: empirical cumulative regret with time horizonT=100K, averaged out of 40 runs.

No documento Contributions to Multi-Armed Bandits : Risk-Awareness and Sub-Sampling for Linear Contextual Bandits (páginas 91-95)