Non-numerical rewards - Renforcements naturels pour la collaboration homme-machine

Traditionally, reinforcement learning algorithms use numerical rewards. This has the advantage of being precise, allowing the agent to learn exactly which policies are the best, and indeed, as we showed in chapter 8, our approach works very well with this type of user feedback. The trouble is, most humans are not comfortable giving precise numeric values. Imagine that you greet your neighbour “How are you this morning?”. He is highly unlikely to reply

“I’m 7.5 today, 3.2 better than yesterday”. He will probably respond “I’m ﬁne thanks, much better than yesterday”.

In the same way, when we ask the analyst to define his requirements, it is highly improbable that he can specify a precise numerical value for a similarity such that “an edit on a character costs 0.3”, “a character deletion costs 0.6”, “a missing first name of length 7 costs 10.5” etc. Or for the treatment time, “15.2 seconds costs 7.9”, and so on. Not only is it nigh on impossible to be exhaustive in the list of possible outcomes, but it is also extremely difficult to evaluate each possibility quantitatively.

Our objective was therefore to ﬁnd a way of learning from qualitative feedback, which the analyst could deﬁne easily and intuitively.

Weng et al. (2013) point out that the definition of a numerical reward function is non-intuitive, especially when this reward does not represent a physical measure. They treat this problem by offering the agent an ordinal reward from a categorical, completely ordered scale. Weng and Zanuttini (2013) treat a similar problem of non-numeric rewards which can be ordered by a tutor. Both reformulate the Ordinal Reward MDP (ORMDP) (Weng, 2011) as a Vector Reward MDP (VMDP). Similarly, in Gilbert et al. (2015a), the user is asked to provide comparisons, expressed as value vectors, rather than give a specific numerical reward.

To illustrate this idea of value vectors:

As a reminder, a policyπ is a set of pairs{(s, a)}, i.e. in a given states,ais the (best) action to perform. An MDP models the probability (as observed by the agent) for each state, for all actions from that state, of arriving ins^′

90 CHAPTER 9. RL WITH INTUITIVE FEEDBACK and getting rewardr.

Instead of using numerical values for the rewards, they are given a “label”, such as w_i (for example, “an extraction in 10 seconds”, or “no extraction”), which is associated automatically with a ﬁnal state.

Following π then gives a vector whose indices are the probabilities of receiving each reward, e.g.

(p1, p2, p3, p4)





 w1







means that the agent receives the ﬁrst rewardw1 with probabilityp1, reward w2 with probabilityp2, and so on.

For Weng and Zanuttini (2013), these strictly ordinal rewards, whilst more natural than a numerical user feedback, can lead to non-intuitive questions for the user : do you prefer2w1+ 7w3+ 3w4 to3w1+w2+ 2w5? Those comparisons can be complicated and non-intuitive to evaluate. For instance, could the analyst judge if it is better to have to change one letter of a name, delete a place, add a year and have an extraction two seconds faster than to have to add a name, change the month, but have an extraction one second slower? Using the technique of Weng and Zanuttini (2013) would also mean potentially asking the user questions during the treatment of the document (at the calculation of the policy), which does not ﬁt in with their normal work-ﬂow.

We also considered using the more relaxed landmarks of qualitative reas- oning (see Travé-Massuyès et al. (2003) for an introduction). This gives us the comparisons +,−,0,? (better than, worse than, similar to, or incom- parable with the previous). Comparing “nothing extracted” with “a perfect extraction”, we could assume−, for instance, or if something was extracted, we would rely on the user to tell us if it were+ or0. This rather Orwellian approach of “double plus good” (Orwell, 1950) still requires the user to make a non-intuitive cognitive judgement (where would they draw the line between +and0, for example), and to be consistent in those judgements.

Humans are not very good at giving precise numerical values, or evalu- ating complex vectors, but they are usually excellent at making simple pair- wise comparisons, so we turned to preference based reinforcement learning (PRBL) (Busa-Fekete et al., 2014; Akrour et al., 2012; Fürnkranz et al., 2012;

Wilson et al., 2012; Wirth and Fürnkranz, 2013a,b; Wirth et al., 2016). This is the integration of two sub-ﬁelds of machine learning, namely preference learning and reinforcement learning.

9.1. NON-NUMERICAL REWARDS 91 For example, in Fürnkranz et al. (2012), they ask the user for their preferences over simulated roll-outs or trajectories. In Figure 9.1, we see that from a given common state s1, the agent simulates the trajectories formed by taking each possible action from that state (the “roll-outs”), and then following a given policy until the ﬁnal states. Maximizing the expectancy of cumulated rewards cannot be done directly as the numerical values of those rewards are not available, but the outcomes can be given a preference order, for instance thatτ1is preferred toτ2, and thatτ2 is preferred toτ3. Knowing this preference order, we can then infer a preference order over the actions from the common state s1: that a1 is preferred to b1, which is preferred to c1. This means that we can infer that Q(sˆ 1, a1) is greater than Q(sˆ 1, b1) which is in turn greater than Q(sˆ 1, c1).

s0 s1

s^′₂

s^′′₂

s_n τ1

s^′_n τ2

s^′′_n τ3

Figure 9.1 – Roll-outs are carried out from s1, giving a preference order:

τ1, τ2, τ3 over the trajectories, and hence over the actions froms1.

In our case, as we saw in chapter 6, we deﬁne our states as vectors of values, one of which is the detected language. We could therefore imagine that in Figure 9.1, s1 could be a state where the language detected is Afrikaans.

The action a1 could be a direct extraction from the original Afrikaans, res- ulting in final statesn. However, Matthew (2015) states that the extraction of named entities from Afrikaans has been neglected in favour of languages such as Dutch. Our South African friends tell us that Afrikaans is similar to Dutch, so we offer an action b1 which is to translate from Afrikaans to Dutch, and then to use the Dutch extraction rules to finish in state s^′_n. We also know that our richest named entity extraction rules are for the English language, and so we try a third action c1, to translate the Afrikaans into English, and then extract, presenting the results in state s^′′_n. We then ask the user to compare the three results, and from their preferences we can infer

92 CHAPTER 9. RL WITH INTUITIVE FEEDBACK which action is the best in states1.

This is expensive in terms of processing and user time, however, as the same document would be treated three times, and the comparison between the results would still not necessarily be intuitive, nor would it enter into the analyst’s normal daily routine. Also, in attempting to construct a consistent reward function from the preference orders, we risk introducing preferential information which was not given explicitly by the user, therefore giving the agent an unintended bias. For example, the user may prefer τ1 to τ2 until he ﬁnds out thatτ1 was produced by sheer luck, and that normally the “a1” path would produceτ4, which is the worst possible result.

No documento Renforcements naturels pour la collaboration homme-machine (páginas 110-113)