LEARNING FROM USERS 17 states (Reyes et al., 2006), which enables us to make generalisations such as

“a turn is easier to take at lower speed” (Epshteyn and DeJong, 2006) or “a translation introduces noise into the text”.

Of course, how the agent learns might depend on the user and their training methods. For example, Knox and Stone (2015) focus on training an agent based on the polarity of the trainer (“encouraging”, “punishing”, etc.).

Humans are naturally biased towards giving a positive reward, which favours myopic learning (where the agent prefers immediate gain to trying for larger, long-term rewards), but by converting episodic tasks to be continuous, the agent can successfully learn non-myopically. They differentiate between the task objective, which is the goal as seen by the user, and thelearning objective which might be “to find a behavioural policy that maximises the expectation of the sum of the future human reward”. They try to find the learning objective which allows the agent to perform well with respect to the task objective, in other words, they try to balance the objectives such that the agent behaves as the trainer intended.

Knox and Stone (2015) also give an interesting insight into how to adjust the long- or short-sightedness of the agent with regard to the rewards it expects. Should it be patient, hoping for larger rewards later, or should it set out to get the maximum gain it can straight away? The parameter that aﬀects this is known as the discount factor γ. They claim that a higher value of γ renders the agent more robust to environmental changes that could block the MDP-optimal path, for example, whilst leaving the goals unchanged. In our case, this could be a run of documents that have no extractable events, for example, or that do not receive user feedback. At the other extreme, setting γ to zero “reduces reinforcement learning of a value function to supervised learning of a reward function”. The disadvantage being that the reward function represents the optimal policy, but not the task goals, and is therefore not necessarily obtainable in “real-life”. This forces the trainer to micro-manage, and is clearly unacceptable in our case.

Their ﬁndings suggest that we should setγ to be high, as we want the agent to be task-focused.

Finally, in production, there will not just be one user of the system, but many, and we must take all of their feedback into account. Shelton (2000) discusses the diﬃculties involved in doing this, for example, how do we ensure that the users all use the same scale, what happens if the users do not agree on the feedback to be given for a particular situation, etc. In fact, our qualitative approach should help respond to some of these questions as there is no longer any need to normalise the scale used, and the deﬁnition of the high level requirements should encourage a consensus to be reached.

18 CHAPTER 2. RELATED WORK

Part I

Learning to improve an information extraction chain

from intuitive feedback

Chapter 3 Introduction to Part I

This part of the manuscript represents the bulk of the work done during the three years of the thesis.

We ﬁrst present the challenge that we faced in the context of the CIFRE contract, that of extracting information from open-source documents for the Open Source INTelligence (OSINT) community using the platform WebLab, and explain a little more about the OSINT community.

We then give a rapid introduction to reinforcement learning (RL) and Markov Decision Processes (MDPs), in order to make our implementation choices clearer.

We go through the thought process of modelling a chain as an MDP, starting with a generic discussion of the information available, which should be applicable to any other chain. We then go into the speciﬁc choices that we made for our implementation.

We present BIMBO, the brains behind the operation. She is a fully conﬁgurable, modular platform, applicable to a variety of situations (we later apply her to image analysis in chapter 13, for example). She is responsible for interpreting the AIs’ abstract choice of an action into a tangible one, and translating the current state of the document and system into a state that the AI understands. She measures the quality of the results found, keeps track of the time spent on treatment, and outputs logs and result ﬁles in order that the performance of the AI can be monitored.

We give a ﬁrst suite of experiments which act as a proof of concept, showing that modelling an IE treatment chain as an MDP and then using a standard reinforcement learning technique to build the chain step by step is feasible.

We then move away from the standard approach of rewarding the agent with quantitative, or numerical feedback, and discuss how it could be rewar- ded with qualitative, or non-numerical feedback.

Finally, we implement a new algorithm, SSB Q-learning (Gilbert et al., 2016), and carry out a variety of tests that show that qualitative (non- numeric) feedback can be given with excellent results.

22 CHAPTER 3. INTRODUCTION TO PART I

Chapter 4 The industrial challenge

As we said in section 1.1, our research is driven by the Open Source IN- Telligence (OSINT or OSCINT) community. The aim of OSINT is to extract structured information from unstructured open-source data. Steele (1995) deﬁnes OSINT as “[. . . ] intelligence derived from public information – tailored intelligence which is based on information which can be obtained legally and ethically from public sources.”

This information contributes to the knowledge of the OSINT analyst, and enables a full appreciation of the situation, allowing informed decision- making and actions. These decisions and actions depend on the domain’s objectives, whether it is to seek an industrial advantage, to gather techno- logical intelligence, or to enhance an online reputation. Even though the objectives may vary, every OSINT domain faces the same two main chal- lenges (WebLab, 2016a):

Unstructured and complex data. There are a huge number of open source data readily available, especially on the Web, and in most cases, the content is very rich. However, it is a complex and fastidious task to gather usable information. The Web is dynamic, constantly growing and changing, the sources are multi-lingual, and are presented in many diﬀerent media formats. Finally, there is often little or no cross-referencing or validation of the data.

Choice of tools. Not surprisingly, with the growth in data, many tools have been developed to try to extract the information. First, the best tool for each function must be chosen, whether it is the transcription of a video, or the translation of a document. Then these tools must be made to work together so that, for example, the video can be collected and transcribed, the transcription obtained can be translated, and the relevant information extracted from the translation. Like the Web, the tools are not static. The system must therefore be flexible, allowing treatments to be adapted, sources to be changed, and different information extraction requirements to be satis- fied. Many data analysis systems therefore rely on a distributed architecture,

24 CHAPTER 4. THE INDUSTRIAL CHALLENGE allowing a modular treatment of the multimedia (for example, Ogrodniczuk and Przepiórkowski (2010) give an overview of some processing chains that are provided as web services).

No documento Renforcements naturels pour la collaboration homme-machine (páginas 38-45)