Techniques for batch reinforcement learning in robotics

(1)

Universidade de Aveiro Departamento deElectrónica, Telecomunica¸cões e Informática, 2015

Jo˜

ao Alexandre da

Silva Costa e Cunha

T´

ecnicas para a Aprendizagem por Refor¸

co em

Lote na Rob´

otica

Techniques for Batch Reinforcement Learning in

Robotics

(2)

(3)

Universidade de Aveiro Departamento deElectrónica, Telecomunica¸cões e Informática, 2015

Jo˜

ao Alexandre da

Silva Costa e Cunha

T´

ecnicas para a Aprendizagem por Refor¸

co em

Lote na Rob´

otica

Techniques for Batch Reinforcement Learning in

Robotics

Tese apresentada à Universidade de Aveiro, para cumprimento dos requisi-tos necessários à obten¸cão do grau de Doutor em Engenharia Informática, realizada sob a orienta¸cão cient´ıfica dos Prof. Doutor José Nuno Panelas Nunes Lau e Prof. Doutor António José Ribeiro Neves, professores auxili-ares do Departamento de Electrónica, Telecomunica¸cões e Informática da Universidade de Aveiro

(4)

(5)

o j´uri

presidente Doutor Joaquim Manuel Vieira

Professor Catedr´atico da Universidade de Aveiro

Doutor Lu´ıs Paulo Gon¸calves dos Reis

Professor Associado na Universidade do Minho

Doutor Rui Alexandre de Matos Ara´ujo

Professor Auxiliar na Faculdade de Ciˆencias e Tecnologia da Universidade de Coimbra

Doutor Francisco Ant´onio Chaves de Saraiva de Melo

Professor Auxiliar no Instituto Superior T´ecnico da Universidade de Lisboa

Doutor Artur Jos´e Carneiro Pereira

Professor Auxiliar na Universidade de Aveiro

orientador Doutor Jos´e Nuno Panelas Nunes Lau

(6)

(7)

agradecimentos Agrade¸co ao Professor Doutor Nuno Lau por ter aceitado o desafio de me orientar como aluno de doutoramento. Obrigado pelo apoio e pela confian¸ca que depositou em mim ao longo deste percurso.

Agrade¸co `a equipa CAMBADA pelo ambiente de camaradagem,

es-pecialmente durante as competi¸c˜oes. Um apre¸co especial ao Gustavo

Corrente, Ricardo Sequeira, Frederico Santos, Daniel Martins, Nuno Figueiredo e Nelson Filipe, alguns ex-membros com quem tive o prazer

de perder horas de sono a preparar as competi¸c˜oes. Aos membros que

ficam, resta desejar-lhes muito sucesso e que consigam dar continui-dade a todo este esfor¸co.

Um agradecimento particular ao Eurico Pedrosa e ao Rui Ferreira pela

ajuda incans´avel nos projectos LUL/CAMBADA@Home e

PRODU-TECH. As eles se devem, em grande parte, os resultados obtidos nestes

projectos. Sem as discuss˜oes cient´ıficas e outras conversas de caf´e, esta

etapa n˜ao teria sido a mesma.

Um agradecimento tamb´em ao Professor Doutor Martin Riedmiller e

ao Manuel Blum da Universidade de Freiburgo pela disponibilidade no

esclarecimento de d´uvidas sobre aprendizagem por refor¸co e processos

gaussianos.

Um agradecimento aos membros do IRIS, Lab Aneesh Chauhan, Nima Shafii, Abbas Abdolmaleki e Miguel Oliveira por partilharem comigo os seus diversos conhecimentos.

Agrade¸co `a Universidade de Aveiro e em particular ao IEETA pelo apoio

log´ıstico e financeiro que tornaram poss´ıvel a realiza¸c˜ao deste

douto-ramento.

Agrade¸co aos meus familiares directos, os meus pais, Estefˆania e

Ma-nuel e ao meu irm˜ao Filipe pelo apoio durante o doutoramento.

`

A Pac, não há palavras para expressar a minha gratidão. O teu apoio

foi infind´avel e incans´avel. Obrigado por teres estado sempre ao meu

lado, quando mais precisei.

Agrade¸co a todos os que me apoiaram, directa ou indirectamente, tanto

(8)

(9)

Palavras-chave Aprendizagem por Refor¸co, Aprendizagem por Refor¸co em Lote,

Rob´otica, Processos Gaussianos, Modelos, Aprendizagem

Computa-cional

Resumo Esta tese aborda a aplica¸c˜ao de m´etodos de Aprendizagem por

Re-for¸co em Lote na Rob´otica. Como o nome indica, os m´etodos de

Aprendizagem por Refor¸co em Lote aprendem a completar uma tarefa

processando um lote de interaçcões com o ambiente. São propostas

trˆes contribui¸c˜oes que procuram possibilitar a aprendizagem de uma

forma mais r´apida e est´avel.

A regra Q-learning ´e amplamente usada dado que permite aprender

sem a existˆencia de um modelo do ambiente. No entanto, esta tem

por base uma única transi¸cão, não tirando partido da estrutura

base-ada em episódios do lote de experiências. É proposta, neste trabalho,

a regra Q-Batch que processa as experiências através das trajectórias

descritas aquando da interaçcão. Desta forma, é possivel propagar mais

rapidamente o valor das recompensas e penaliza¸c˜oes obtidas,

permi-tindo assim aprender de uma forma mais robusta e r´apida.

´

E também explorada a aplica¸cão de aproxima¸cões não paramétricas

como Processos Gaussianos. Este tipo de aproximadores permite

codi-ficar conhecimento pr´evio sobre as caracter´ısticas da fun¸c˜ao a aproximar

sob a forma de n´ucleos, fornecendo maior flexibilidade e precis˜ao. A

aplica¸c˜ao de Processos Gaussianos na Aprendizagem por Refor¸co em

Lote apresentou um maior desempenho na aprendizagem de

compor-tamentos do que outras aproxima¸c˜oes existentes na literatura.

Por último, de forma a extrair mais informa¸cão das experiências

ad-quiridas pelo agente, s˜ao incorporadas t´ecnicas de aprendizagem de

modelos de transi¸c˜ao. Desta forma, ´e possivel ampliar o conjunto de

experiências adquiridas através da interaçcão com o ambiente, com

experiˆencias geradas atrav´es de planeamento com recurso aos modelos

de transi¸c˜ao.

Foram realizadas experiˆencias principalmente em simula¸c˜ao, com

al-guns tests realizados numa plataforma rob´otica f´ısica. Os resultados

obtidos mostram que as abordagens propostas s˜ao capaz de superar o

(10)

(11)

Keywords Reinforcement Learning, Batch Reinforcement Learning, Robotics, Gaussian Processes, Models, Machine Learning

Abstract This thesis addresses the Batch Reinforcement Learning methods in

Robotics. This sub-class of Reinforcement Learning has shown promis-ing results and has been the focus of recent research. Three con-tributions are proposed that aim to extend the state-of-art methods allowing for a faster and more stable learning process, such as required for learning in Robotics.

The Q-learning update-rule is widely applied, since it allows to learn without the presence of a model of the environment. However, this update-rule is transition-based and does not take advantage of the underlying episodic structure of collected batch of interactions. The Q-Batch update-rule is proposed in this thesis, to process experiencies along the trajectories collected in the interaction phase. This allows a faster propagation of obtained rewards and penalties, resulting in faster and more robust learning.

Non-parametric function approximations are explored, such as

Gaus-sian Processes. This type of approximators allows to encode prior

knowledge about the latent function, in the form of kernels, providing a higher level of flexibility and accuracy. The application of Gaussian Processes in Batch Reinforcement Learning presented a higher perfor-mance in learning tasks than other function approximations used in the literature.

Lastly, in order to extract more information from the experiences col-lected by the agent, model-learning techniques are incorporated to learn the system dynamics. In this way, it is possible to augment the set of collected experiences with experiences generated through planning us-ing the learned models.

Experiments were carried out mainly in simulation, with some tests car-ried out in a physical robotic platform. The obtained results show that the proposed approaches are able to outperform the classical Fitted Q Iteration.

(12)

(13)

List of Figures

2.1 The different Machine Learning paradigms. . . 10

2.2 A Reinforcement Learning System . . . 10

2.3 The dimensions of Reinforcement Learning . . . 12

2.4 A maze environment. . . 15

2.5 The optimal state-value function and action-value function for the maze environment. . . 16

2.6 The optimal policy of the maze environment . . . 17

2.7 Generalized Policy Iteration . . . 18

2.8 Bellman Optimality criterion backup diagram . . . 21

2.9 Monte Carlo backup diagram . . . 23

2.10 Q-learning backup diagram . . . 26

2.11 Optimism in face of uncertainty . . . 28

2.12 Main classes of Reinforcement Learning algorithms. . . 32

2.13 The Fixed and Growing Batch Reinforcement Learning Problems . . . 34

2.14 The Fitted Iteration framework . . . 36

2.15 A tree-based value function using a single decision tree. . . 41

2.16 A Neural Q Function scheme. . . 43

2.17 A sigmoid function profile . . . 44

2.18 Dynamic Scaling Heurisctic . . . 46

2.19 The actor-critic architecture of NFQCA. . . 48

2.20 An autoencoder neural network. . . 50

3.1 Watkins-Q(λ) backup diagram . . . 55

3.2 Q-Batch backup diagram. . . 58

3.3 Example of a smooth cost function . . . 66

(18)

3.5 The mountain car learning task . . . 68

3.6 The best policies found in each experiment while learning on the

determin-istic mountain car environment. . . 70

3.7 The best policies found in each experiment while learning on the stochastic

mountain car environment. . . 72

3.8 The inverted pendulum environment . . . 73

3.9 The inverted pendulum learning task. . . 75

3.10 The best policies found in each experiment while learning on the

determin-istic inverted pendulum environment. . . 76

3.11 The best policies found in each experiment while learning on the stochastic

inverted pendulum environment. . . 78

3.12 A CAMBADA Middle Size League robot. . . 80

3.13 Action set specified for the rotating learning task. . . 81

3.14 Performance comparison between the learned and the hand-coded controllers

in simulation on the rotating learning task. . . 82

3.15 Performance comparison between the learned and the hand-coded controllers

in a real environment on the rotating learning task. . . 83

3.16 Learning performance over time of the two update-rules used on the rotating

learning task. . . 84

3.17 Action set specified for the dribbling learning task. . . 85

3.18 Comparison between the learned and hand-coded controllers in simulation

on the dribbling task. . . 86

3.19 Comparison between learning and hand-coded controllers in a real

environ-ment on the dribbling learning task. . . 87

3.20 Visual representation of the pass learning task. . . 89

3.21 Comparison of the hand-coded and the learned behavior when receiving a

pass in simulation. . . 91

3.22 Comparison of the hand-coded and the learned behavior when receiving a

pass in the real platform. . . 92

4.1 Gaussian Process toy example. . . 99

4.2 Local Gaussian Process approach . . . 105

4.3 Performance of the best policy found for the full GP hyperparameter grid

(19)

4.4 Performance of the best policy found for the full GP hyperparameter grid

search in stochastic mountain car environment. . . 113

4.5 Performance of the best policy found for the full GP hyperparameter grid search in deterministic inverted pendulum environment. . . 116

4.6 Performance of the best policy found for the full GP hyperparameter grid search in stochastic inverted pendulum environment. . . 119

5.1 The proposed model learning fitted iteration framework. . . 128

5.2 Graphical models of different transition models. The darker nodes represent the latent variables. . . 130

A.1 The data model. . . 154

A.2 The fitted iteration model. . . 156

A.3 The agent model. . . 157

A.4 The BRLL architecture. . . 158

(20)

(21)

List of Tables

2.1 Comparison of different development methodologies for intelligent robots in

terms of domain knowledge requirements and learning complexity. . . 12

3.1 The standard structure of collected interaction set F . . . 51

3.2 A time consistent data set F with an episode-based structure. . . 56

3.3 Experimental results of mountain car on a deterministic environment. . . . 68

3.4 Experimental results of mountain car on a non-deterministic environment. 70

3.5 Experimental results for the inverted pendulum on a deterministic

environ-ment. . . 74

3.6 Experimental results for the inverted pendulum on a non-deterministic

en-vironment. . . 77

4.1 Analysis of the ten best performing hyperparameters in the deterministic

4.2 Experimental results regarding the application of Local GPs in the

deter-ministic mountain car environment. . . 112

4.3 Analysis of the ten best performing hyperparameters in the stochastic

moun-tain car environment. . . 114

4.4 Experimental results regarding the application of Local GPs in the stochastic

4.5 Analysis of the ten best performing hyperparameters in the deterministic

4.6 Experimental results regarding the application of Local GPs in the

deter-ministic inverted pendulum environment. . . 118

4.7 Analysis of the ten best performing hyperparameters in the stochastic

(22)

4.8 Experimental results regarding the application of Local GPs in the stochastic

5.1 The structure of the interaction data set F . . . 125

5.2 Experimental results of mountain car on a deterministic environment. . . . 136

5.3 Experimental results of mountain car on a non-deterministic environment. 137

5.4 Experimental results of inverted pendulum on a deterministic environment. 139

5.5 Experimental results of inverted pendulum on a non-deterministic

(23)

List of Algorithms

1 Policy Iteration . . . 20

2 Value Iteration . . . 20

3 Every-visit Monte Carlo method . . . 22

4 TD(0) algorithm . . . 25

5 The Q-Learning algorithm . . . 26

6 Generic Fitted Iteration . . . 37

7 Q-Batch pattern generation algorithm . . . 59

8 Local Gaussian Process Recursive Clustering . . . 106

9 Dyna-Q . . . 127

10 Model Learning and Planning Fitted Q Iteration . . . 129

11 Naive exploration algorithm . . . 133

(24)

(25)

Chapter 1 Introduction

In the last decades, the field of Robotics has been experiencing a tremendous growth. Robots are no longer confined to factory floors, but are steadily being introduced to more and more aspects of our daily life. Examples span from comercially available vacuum clean-ers (Jones 2006, Rooks 2001), to more experimental robots applied in domains as service robotics (Bohren et al. 2011, Reiser et al. 2009), military (Sharkey 2007, Yamauchi 2004), precision agriculture (English et al. 2014, Johnson et al. 2009), power line inspection (Mar-tinez et al. 2014, Mills et al. 2010, Sa and Corke 2014), autonomous driving (Thrun et al. 2006, Urmson et al. 2008), among others. As robots begin to tackle increasingly unstruc-tured environments, the complexity of the robot behaviors increases. Classical development approaches, that include preprogramming the robot logic by hand, fail to scale with this increase in complexity. A promising alternative solution involves allowing a robot to learn from its past actions.

Reinforcement Learning provides a valuable alternative to develop intelligent robots by allowing the robot to interact with the environment while teaching it through penalties and rewards (Kormushev et al. 2013). This greatly reduces the required domain knowledge needed to allow a robot to perform a given task. Among the existing Reinforcement Learn-ing approaches, the Batch Reinforcement LearnLearn-ing framework has been widely applied in Robotics due to the increased data-efficiency and the ability to learn in continuous state and action spaces. Despite being an active research field, the robots of today are still far from being able to learn autonomously. This thesis explores several approaches in order to move closer to the goal of autonomous learning. The main goal of this thesis is to con-tribute to the state of the art of Batch Reinforcement Learning through the combination of batch oriented update-rules, novel function approximators and model learning and planning

(26)

approaches.

1.1 Motivation

Reinforcement Learning has the potential to help advance the field of Robotics. How-ever, classical Reinforcement Learning methods were not suitable for application in real robots since they need to collect a large number of interactions with the environment before good policies were obtained. Take for example, the famous backgammon agent from The-sauro (TeThe-sauro 1992, 1994, 1995), that was able to learn to play at top-level after millions of games. In recent years we have been observing the proposal of more sample efficient methods. Two diverging approaches have emerged, Batch Reinforcement Learning and Policy Search, both of which have been applied to real robots (Kober et al. 2013).

Policy Search is more closely related to the field of Stochastic Optimization. Policy Search does not calculate a value function, but instead incrementaly searches for better policies through the search space of the representable policies. A key issue in Policy Search is then policy parameterization (Kormushev et al. 2012). Existing methods perform lo-cal updates to existing policies in order to maximize the occurrence of rewards, based on Monte Carlo evaluations. Proposed methods differ on the policy update procedure, such as gradient-based (Ng et al. 2004, Peters and Schaal 2006, 2008) and gradient-free optimiza-tion (Bagnell and Schneider 2001, Rubinstein and Kroese 2004), populaoptimiza-tion-based (Kormu-shev and Caldwell 2012, Sun et al. 2009) or expectation maximization (Kober and Peters 2011, Vlassis et al. 2009). Additionally, to avoid falling on local minima, optimization can constrained, such as maintaining the entropy of the distribution of the policy parameters above a given threshold (Peters et al. 2010). At the core of the optimization process is usually the presence of a model of the system dynamics that is used to evaluate the perfor-mance of the policy without interacting with the real system (Deisenroth and Rasmussen 2011). This can greatly reduce the interaction needed by an agent to learn how to perform a task.

Batch Reinforcement Learning is a paradigm that estimates a value function, from which a policy is derived, based on a set of interactions (Lange et al. 2012). The value function is usually represented by a function approximator that is trained using batch supervised learning, providing a more robust approximation. It provides generalization to unvisited states as well as in continuous state spaces (Gordon 1995). By working with an existing set of interactions, Batch Reinforcement Learning presents a powerful advantage,

(27)

since it is one of the few class of methods that is applicable in fields where it is easy to obtain past interaction data while it is very hard to obtain permission to interact directly and control the system, such as Medicine and Health Care (Ernst et al. 2006b, Guez et al. 2008, Pineau et al. 2009).

Despite the advantages, Batch Reinforcement Learning still evidences some drawbacks. As an example, if the interactions are sampled from the environment according to an optimal policy, there is no guarantees that the policy derived from the approximated value function can reproduce the sampling policy. A straighforward imitative learning scheme would provide better guarantees of reproducing the observed behaviour. This is related to the fact that policies are evaluated according to transition based update-rules. Another open issue in Batch Reinforcement Learning is related to the stability of the learning process. While the goal is to find a policy that maximizes the rewards over time, while learning, the performance of the learning agent, with respect to the accumulated rewards, is rarely monotonically increasing. It is not uncommon to find publications that when presenting learning results, average the performance over the last episodes, smoothing the aforementioned losses in performance, allowing to better visualize the overall trend in the learning process. This means that there is a risk involved every time the policy is updated, since the performance of the learning agent can decrease. This poses a great challenge in the application of this kind of methods in Robotics, since the fluctuation in performance can result in wear and tear for the robot hardware or potentially harming people in human-robot interaction scenarios.

Batch Reinforcement Learning combines function approximators and Reinforcement Learning in a straightforward manner. While common applied function approximators work as black box systems, not requiring domain knowledge, they simultaneously hinder interpretation and prohibit the inclusion of any prior knowledge available. Recent advances

in regression methods, such as modern kernel methods (Sch¨olkopf and Smola 2002), may

provide potential benefits, that may increase the performance of the Batch Reinforcement Learning methods. However this also poses a challenge since kernel methods usually do not cope well with the large amounts of data usually present in Reinforcement Learning.

Modern function approximators have been successfully applied to approximate the tran-sition model of the environment (Hester and Stone 2011, Nguyen-Tuong and Peters 2011b). This model encodes the effect of the agent action on the environment which in combination with planning approaches can be used to augment the available interactions. While this concept is far from new, the integration of planning and learning methods renders itself

(28)

natural in the context of Batch Reinforcement Learning.

1.2 Contributions

The Batch Reinforcement Learning framework still presents many challenges and open issues. The main goal of this thesis is to advance the state-of-the-art of Batch Reinforce-ment Learning by improving the learning performance of the existing methods in regards to interaction time and learning stability. I will focus on the Fitted Q Iteration, a model-free approach successfully applied in Robotics, as a starting point. To improve the original algorithm several changes were made. The contributions of this thesis are:

1. Batch Oriented Update-Rule: Fitted Q Iteration can also be seen as Q-Learning for Batch Reinforcement Learning. As so and despite the application of function approximators providing more data efficiency, it suffers from the same drawback of Q-Learning: it estimates the value of a policy in a transition wise manner. The result is that this process needs to be repeated in order to propagate the obtained rewards to the initial states of the trajectory. However, this is counter-intuitive in a Batch Reinforcement Learning scenario where the interaction with the environment is often performed in episodes. I propose a novel update-rule, named Q-Batch, oriented to the application in Batch Reinforcement Learning, that is able to estimate the value of a policy based on the evaluation of the trajectories of the interactions throughout episodes while maintaing the properties of an off-policy and bootstrapping evaluation. The proposed approach is able to propagate rewards over long trajectories faster than existing update-rules, resulting in better policies and shorter learning times.

2. Gaussian Process Fitted Q Iteration: A different solution explored in this

thesis is the application of modern kernel-based function approximators in the Fitted Q Iteration framework. An example studied is the application of Gaussian Processes, a powerful non-parametric function approximator to approximate the value-function. A partioning method is presented based on Local Gaussian Processes to allow the application of Gaussian Process Fitted Q Iteration in learning tasks where the number of interactions with the environment grows to the magnitude of tens of thousand samples. The application of Gaussian Processes in the Fitted Q Iteration framework showed that the learning agents can learn in a more stable manner, minimizing the occurrences of policy degradation. The increased stability also allows for faster

(29)

learning times.

3. Integrating Model Learning and Planning in Fitted Q Iteration: The Fitted Q Iteration is a model-free learning method that, as the name implies, is able to learn a given task in the absense of a model of the environment. To learn, an agent interacts directly with the environment and infers a policy based on the accumulated rewards obtained from those interactions. The agent learns the value of its actions while never knowing how they affect the environment. Using model learning approaches on the collected experiences, the agent can also learn the outcome of its actions. This model can then be used to, in a simulation phase, explore the environment and augment the collected real experiences. With this in mind, two additional steps are integrated in the Fitted Q Iteration framework, a model learning phase and a planning/exploration phase. To learn the model, Gaussian Processes are applied, not only for their accuracy, but also to model the uncertainty of stochastic environments. To augment the experience set, sample-based planners are applied.

4. Evaluation: While the focus of this thesis is on the development of methods that will better allow the application of Batch Reinforcement Learning in real robots, we center our evaluation of the aforementioned contributions on simulated environments. This allows for a controlled evaluation of the effects of the proposed ideas while avoiding the effects from robot specific sub-systems such as vision, state estimation, control, among others. Special care was also taken in order to allow the remaining Reinforce-ment Learning research community to reuse the evaluations carried out. Therefore the results were obtained in common benchmarking environments, such as mountain

car and inverted pendulum. Moreover, the CLS2_{, Closed Loop Simulation System,}

was chosen. CLS2 _{is an open-source simulator which provides these benchmarks}

off-the-shelf with the system dynamics commonly used in the Reinforcement Learning literature. While most tests are carried out in simulated environment, some limited evaluations are carried out in real platforms as a proof-of-concept.

5. Open-source Batch Reinforcement Learning library: To support the evalua-tion, the proposed methods were implemented in a library that decouples the learning task design from the adaptation methods used during the learning phase. In this way, one can focus on the state representation, the actions composing the action set, and easily interchange the learning methods to compare and benchmark learning perfor-mances. Conversely, this decoupling allows a developer to reuse the same learning

(30)

methods across diferent environments. The library focuses on Batch Reinforcement Learning implementing the Fitted Q Iteration framework and the contributions of this thesis. This way, the library provides a starting point for researchers entering the Batch Reinforcement Learning field along with easy access for experienced researchers to more recent methods. Special care was taken in defining meaningful abstractions fostering extensions and implementation of future research from the Reinforcement Learning community.

1.3 Outline of the Thesis

The structure of the thesis reflects the different proposed contributions. The thesis is organized in six chapters and one appendix.

Chapter 1 introducts this thesis. The main motivations driving the developed work are discussed and the contributions of the thesis are presented.

Chapter 2 presents the main concepts of Reinforcement Learning, discussing Markov Decision Processes, value-functions and common update-rules, the corresponding concepts in Batch Reinforcement Learning and also recent developments in the field. This chapter provides the background knowledge for the rest of the thesis, and is recommended on a first reading.

Chapter 3 describes the Q-Batch update-rule. This novel update-rule combines episodic evaluation of Monte Carlo methods with the bootstrapping property of Temporal-Difference methods to achieve an improved performance. To integrate Q-Batch in the Batch Rein-forcement Learning framework we also propose a different struture for the batch of experi-ences. We present experimental results of the application of Q-Batch in non-linear control problems showing that the proposed update-rule outperforms other existing alternatives.

Chapter 4 presents Gaussian Process Fitted Q Iteration that, as the name implies, ap-proximates the Q-function using Gaussian Processes. General concepts regarding inference and optimization are presented along with current limitations. The chapter also presents a local approximation scheme, Local Gaussian Processes, to better cope with large amounts of data. The experimental evaluation assesses the applicability of Gaussian Processes in the Batch Reinforcement Learning framework for value-function approximation. Experimen-tal results show that the characteristic high approximation accuracy of Gaussian Processes obtains stable learning and fast learning times.

(31)

context of Batch Reinforcement Learning. The chapter starts by presenting background knowledge on model learning techniques using statistical methods and existing plan-based RL methods. Gaussian Processes are applied for their accuracy and ability to explicity represent the uncertainity of the predictions.

Chapter 6 draws the main conclusions of this thesis and discusses open questions and directions for future work.

Given its potential future impact Appendix A presents the developed Batch Reinforce-ment Learning library. The general architecture is introduced and some design choices discussed. With a focus on extensibility, a detailed description of the process of extending the different components of the library is provided.

(32)

(33)

Chapter 2 Reinforcement Learning

This chapter presents the necessary background knowledge on Reinforcement Learning to understand the work developed in the context of this thesis. While the chapter provides a broad perspective, covering different concepts, it is not intended to provide a thorough analysis of the current state of the field. The Reinforcement Learning field has experienced a significant growth in the last decades having sprawled in multiple sub-fields from which each corresponding community aims to address different problems which in turn results in a variety of different methods. For a complete overview, interested readers are referred to (Wiering and van Otterlo 2012).

Machine Learning is the field of study that develops learning capabilities in computer programs (Alpaydin 2014). Instead of executing pre-programmed instructions, a machine learning algorithm is able to learn from experience, or more specifically, learn from data. Tom Mitchel proposed the following definition: “a program is said to learn from experience E with respect to some class of tasks T , and performance measure P , if its performance at task in T , measured by P , improves with experience E” (Mitchell 1997). The Machine Learning field can be divided in three main areas, as showed in Figure 2.1. Each area is characterized by some degree of supervision. The Supervised Learning field processes data composed by a set of features that is labelled with target values, usually provided by a supervisor or a teacher (Bishop 2006). The goal is to develop methods that learn the cor-respondence between the input features and the target output. In Unsupervised Learning there is no labelled data, and the purpose is to detect hidden similarities and other pat-terns contained in the data(Ghahramani 2004, MacKay 2003). In Reinforcement Learning, supervision is provided in the form of rewards and punishments, instead of explicit desired outputs, and the goal is learn how to avoid punishments and obtain rewards (Sutton and

(34)

Barto 1998, Szepesv´ari 2010). Machine Learning Unsupervised Learning Reinforcement Learning Supervised Learning

Figure 2.1: The different Machine Learning paradigms.

Reinforcement Learning (RL) draws inspiration from biology and psychology and tries to mimic animal behaviour learning. It is motivated by Throndike’s Law of Effect (Throndike 1911), which states that “applying a reward immediately after the occurrence of a response increases its probability of reoccurring, while providing punishment after the response will decrease the probability”.

Figure 2.2 illustrates a Reinforcement Learning system. At time t, the learning agent

observes state stand chooses action at. The agent is then able to perceive the effect of the

chosen action at, by observing the successor state st+1, and the reward rt+1 of choosing at

in state st.

(35)

Learning is defined by searching for a policy π from the space of all possible policies that maximizes the sum of rewards over time. To attain this, the agent usually interacts with the environment in order to find which actions provide the largest cumulative reward.

It should be noted that, the obtained reward rt+1 only reflects the impact of choosing

action atin state st. In other words, by itself the feedback provided by the reward in a given

instant time t does not instruct an agent on the desired actions to take to accomplish the task being learned. Only by trying different actions is the agent able to determine which actions yield the best outcome in the long run. This presents a sharp contrast to supervised paradigms where the training data includes the desired learning outcome. The evaluative nature of the feedback provided by the immediate reward provides an additional advan-tage over the instructive feedback provided by the training data of supervised paradigms. Whereas in supervised paradigms the training data instructs the learning agent on what to

do in state st, which requires the presence of very rich domain knowledge, a problem arises

if the instructed action is not the best possible action available. In other words, supervised paradigms are bound by the limitations of the supervisors. Contrarily, in Reinforcement Learning an agent is not provided the best action available but is informed on how good (or bad) a given action is. The agent must then use solely this information to infer a policy that accumulates the largest reward over time. One can argue that a RL agent is in turn limited by the reward function. While this is true, as will be shown throughout this thesis, complex behaviors can emerge from simple reward functions.

Table 2.1 compares the domain knowledge requirements and learning complexity of dif-ferent methodologies to develop intelligent agents. As we can observe these two dimensions present an inverse relationship. At one end of the spectrum are hard-coded systems, where a developer (often tediously) pre-programs the behavior of the robot, explicitly specify-ing the desired behavior for all possible situations that the robot might encounter. Since all behavior logic is pre-specified, no learning occurs. Learning from demonstration ap-proaches require the supervisor to instruct the robot on which are the best actions on given states, allowing the robot to generalize the observed actions to unvisited states. Fi-nally in Reinforcement Learning approaches, as mentioned before, the robot interacts with the environment and receives rewards or penalties according to its behavior. The rewards are then used to adapt the behavior to obtain even higher rewards. From a development point of view, we shall see that it is relatively easy to design reward functions for Reinforce-ment Learning agents. Conversely, with such a reduced amount of information available, the learning complexity grows, since not only does the agent needs to determine which is

(36)

the best action to take in a given state, but it also needs to generalize to unvisited states and actions.

Learning approach Domain knowledge required Learning complexity

Hard-coded System High None

Learning from demonstration Medium Medium/Low

Reinforcement Learning Low High

Table 2.1: Comparison of different development methodologies for intelligent robots in terms of domain knowledge requirements and learning complexity.

One very important detail in RL, is that since the goal is to maximize the accumulated rewards over time, RL presents a planning dimension. It is easy to understand that, sometimes, an agent may choose an action that yields a small immediate reward, because the outcome of that action may lead to more desirable states in the future, and consequently higher rewards. RL methods must also be able to cope with stochastic environments where there is uncertainty associated to each transition. Finally, a RL agent learns from its interactions with the environment and adapts its behaviour. These three different dimensions, Planning, Uncertainty, and Learning, compose the Reinforcement Learning field, as represented visually in Figure 2.3.

Uncertainty

RL

Learning

Planning

Figure 2.3: The dimensions of Reinforcement Learning: Learning, Uncertainty and Plan-ning.

(37)

2.1 Return and Value Functions

As stated before, the reward signal rt, obtained by a Reinforcement Learning agent

provides only a notion of how good (or bad) it is to apply a given action in a given state. By not being told the best action to take beforehand, the agent must observe the obtained rewards and choose the actions that maximize the sum of immediate rewards over time, or return as defined in Equation 2.1 :

Rt= rt+1+ rt+2+ rt+3+ . . . + rT (2.1)

being T a final time step. This definition is suitable for tasks where the agent ends a training run, also named an episode, in a terminal state. However, many tasks can not be clearly decomposed into episodes. For these continuing tasks, Equation 2.1 is not suitable, since, for T = ∞, we could easily have an infinite cumulative return. Therefore, we need to introduce the concept of discounting. Instead of trying to maximize the return, an agent should maximize the discounted return, or discounted cumulative reward:

Rt= rt+1+ γrt+2+ γ2rt+3+ . . . =

∞

X

k=0

γkrt+k+1 (2.2)

being γ, 0 ≤ γ ≤ 1, the discount rate. This discount rate incrementally reduces the value of future rewards, working similarly to a planning horizon. As γ tends to zero, the agent becomes more shortsighted, as it tries to maximize only the immediate rewards, lim

γ→0Rt= rt+1. In general, this is a bad choice since by choosing the action with the largest

immediate reward the agent can move to more undesired states, losing the ability to plan in the future. Accordingly, as γ tends to 1, the agent becomes more farsighted. If we choose γ < 1, then Equation 2.2 is bounded, which means that the formulation of discounted returns can be applied to both continuous as well as episodic tasks.

The objective of learning is to obtain a policy, π. A stochastic policy is a mapping from every state of the state set, s ∈ S, and every action available of the action set of each state, a ∈ A(s), to the probability of choosing action a in state s, π(s, a) → [0, 1]. Alternatively, a deterministic policy is a mapping from every state of the state set to an action available from the action set of each state, π(s) → a ∈ A(s).

To obtain a policy, some Reinforcement Learning algorithms estimate value functions. Two types of value functions are used in Reinforcement Learning: state-value functions and action-value functions.

(38)

A state value function, as the name implies, estimates the value of a given state. In this context the value is related to accumulated rewards obtained from following a given policy, starting in the given state. Since the obtained return is dependent on the current policy π,

value functions are defined with respect to a particular policy, Vπ_{(s). We formulate V}π_(s)

in Equation 2.3. Vπ(s) = Eπ{Rt| st= s} = Eπ ∞ X k=0 γkrt+k+1| st= s (2.3) Here we consider the expected value of the discounted return to cope with the stochastic nature of the environment which may not yield the same value every time. Given a value

function Vπ, the corresponding deterministic policy π, corresponds to the action that

maximizes the observed return:

π(s) = arg max

a E{Rt| st = s} (2.4)

Alternatively, a policy can be obtained by estimating “how good” it is for the agent to choose action a in state s. The function that gives this estimation is called the action

value function and is denoted by Qπ_{(s, a). The action-value function is given by:}

Qπ(s, a) = Eπ{Rt| st = s, at= a} (2.5)

Similarly to Equation 2.3, in Equation 2.5 the value of taking action a in state s is given by the observed cumulative reward, this time assuming that action a is chosen at time t and then policy π is followed. Computing a policy for an action-value function follows straightforwardly:

π(s) = arg max

a Q(s, a), ∀s ∈ S (2.6)

Given an action value function it is possible to calculate the value of a given state, since the two different types of value functions follow the relationship in Equation 2.7.

V (s) = max

a Q(s, a) (2.7)

The final goal of any Reinforcement Learning algorithm is to improve its policy π until

the optimal policy is found. We denote an optimal policy by π∗ and conversely, the state

(39)

V∗(s) ≥ Vπ(s), ∀s ∈ S, ∀π (2.8)

Correspondingly, an optimal action value function, Q∗, is found when:

Q∗(s, a) ≥ Qπ(s, a), ∀s ∈ S, ∀a ∈ A(s), ∀π (2.9)

The following presents a visual representation of the concepts introduced above. Con-sider the maze environment represented in Figure 2.4. The goal is for an agent to complete the maze by moving from the start to the end cells. The agent can move in the four main directions: up, down, left and right. The environment is deterministic and moving towards an occupied cell maintains the agent in the same cell.

start

end Figure 2.4: A maze environment.

For every move made inside the maze, the agent receives a penalty of -1. We wish to estimate the policy that accumulates the largest undiscounted return. Since all actions generate a penalty, the optimal policy is the one that can complete the maze in the min-imum number of decisions possible. Any action that does not move the agent one step closer to the end will accumulate more penalties. Figure 2.5 presents the two optimal value functions for the maze environment described above.

(40)

start end -11 -10 -11 -12 -13 -14 -13 -9 -10 -11 -12 -11 -12 -13 -14 -15 -15 -16 -17 -18 -8 -7 -6 -5 -4 -3 -2 -1

(a) The optimal state-value function.

start end -12 -12 -11 -12 -13 -12 -12 -13 -12 -11 -11 -13 -12 -12 -10 -12 -12 -13 -12 -11 -13 -14 -14 -12 -14 -14 -15 -13 -14 -15 -15 -14 -13 -15 -14 -14 -11 -10 -9 -11 -12 -10 -11 -12 -13 -11 -13 -12 -12 -13 -14 -13 -13 -14 -15 -14 -14 -16 -16 -15 -16 -16 -17 -15 -16 -18 -17 -16 -15 -17 -16 -16 -18 -19 -18 -17 -19 -19 -19 -18 -3 -2 -1 -2 -4 -3 -2 -3 -5 -4 -3 -4 -5 -5 -4 -6 -6 -5 -6 -7 -7 -6 -7 -8 -9 -7 -8 -8 -10 -9 -8 -9

(b) The optimal action-value function.

Figure 2.5: The optimal state-value function and action-value function for the maze envi-ronment.

Observing Figure 2.5a, representing the state-value function, we can conclude that the value of each state, according to the optimal value function, coincides with the negative of the number of actions, or decisions, necessary to complete the maze from that state. On the other hand, Figure 2.5b represents the optimal action-value function. We can observe that each state is sub-divided in four cells, corresponding to each possible action. We can observe that the actions that move the agent closer to the end have higher values. Following Equation 2.7, for each state the largest value in the action-value function is equal to the value of the same state in the state-value function. The actions that move the agent against an occupied cell accumulate one more penalty, since for one additional cycle the agent will remain in the same cell. Finally, actions that move the agent away from the goal accumulate two additional penalties, since the agent will require two more actions (the one chosen included) to move back to the same cell.

From the estimated optimal value functions, we can obtain the optimal policy following Equations 2.4 and 2.6. In Figure 2.6, the optimal policy is represented. One can observe that for all states the policy outputs one or more actions that guide the agent towards the

(41)

end of the maze.

start

end

Figure 2.6: The optimal policy of the maze environment, obtained from the state-value function (Figure 2.5a) and action-value function (Figure 2.5b).

Most RL methods define an iterative solution called generalized policy iteration, which is defined by two interacting processes, policy evaluation and policy improvement, as rep-resented in Figure 2.7. Policy evaluation is any method that evaluates a value function according to a given policy. Policy improvement, as the name implies, uses the newly es-timated value function to improve the current policy. These two processes alternate until the optimal policy and optimal value function are obtained. From then on the process estabilizes.

In the following sections, three different classes of methods are presented, that esti-mate differently the value functions defined above: Dynamic Programming, Monte Carlo Methods and Temporal Difference Learning.

2.2 Dynamic Programming

Dynamic Programming is characterized by iteratively solving sub-problems, usually computing multiple intermediate solutions simultaneously, which are later combined to

(42)

improvement

evaluation

V / Q

π

Figure 2.7: Generalized Policy Iteration, adapted from (Sutton and Barto 1998).

converge to the final solution. The intermediate solutions are usually stored in memory (memoized ) after being computed which later allows a simple lookup to obtain them, avoid-ing recomputations. A requirement of dynamic programmavoid-ing methods is that solutions are combined satisfying a recursive relationship.

Reinforcement Learning is usually modeled as a Markov Decision Process (MDP). An MDP is characterized by a set of states S, a set of actions A, a state transition function T

that encodes the dynamics of the system by calculating the probability of reaching state s0

when action a is applied in state s, T : S ×A×S → [0, 1] and an immediate reward function

r that provides a reward when the agent transitions from state s to state s0 applying action

a, r : S × A × S → R. A fundamental characteristic of MDPs is that state s0 only depends

on the previous state s and the chosen action a.

Solving the MDP involves determining the policy that accumulates the largest return. The idea behind the application of Dynamic Programming is to compute a new value func-tion from the current value funcfunc-tion. In Reinforcement Learning, the concept of methods producing estimates based on existing estimates is called bootstrapping. This concept will play a key role in Chapter 3 (Q-Batch).

Dynamic Programming methods extensively use Bellman Equation (Equation 2.10). The Bellman equation is a recursive relationship that calculates the value of a given state, the expected discounted return, by the sum of the immediate reward and the value of the following states, weighted by the corresponding transition probabilities.

(43)

Vπ(s) = Eπ{Rt| st= s} = Eπ ∞ X k=0 γkrt+k+1 | st= s =X s0 T (s, π(s), s0) " r(s, π(s), s0) + γEπ ∞ X k=0 γkrt+k+2| st+1 = s0 # =X s0 T (s, π(s), s0)[r(s, π(s), s0) + γVπ(s0)] (2.10)

Bellman Equation can be used to evaluate policy π, completing the policy evaluation step of Generalized Policy Iteration. From Equation 2.7 we can evaluate an action value function by additionally substituting π(s) by a in Equation 2.10. When the optimal policy

is found V∗, the value for all states is given by the action that accumulates the largest

expected return. This is known as Bellman optimality criterion and is presented in Equa-tion 2.11. V∗(s) = max a X s0 T (s, a, s0)[r(s, a, s0) + γV∗(s0)] (2.11)

Policy improvement can be obtained by finding the action that maximizes the Bellman optimality equation: π(s) = arg max a X s0 T (s, a, s0)[r(s, a, s0) + γVπ(s0)] (2.12)

Equation 2.12 extends Equation 2.4, defining a policy in terms of the Bellman optimality equation. This carries the implication that, in order to obtain a policy from a state-value function, the presence of the MDP model is required in order to estimate the future expected return.

Dynamic Programming methods turn Bellman Equations into update-rules, assign-ments that update new value functions. The previous examples of policy evaluation and policy improvement compose the Policy Iteration algorithm, presented in Algorithm 1.

Despite the name Policy Iteration, it is not to be confused with the Generalized Policy Iteration concept consisting of any process that evaluates a policy alternating with any process that improves a policy until convergence with the optimal policy. In fact, in the

(44)

Algorithm 1 Policy Iteration

Input: V (s) arbitrarily initialized, ∀s ∈ S

1: repeat 2: repeat 3: ∆ ← 0 4: for all s ∈ S do 5: v ← V (s) 6: V (s) ←P s0T (s, π(s), s0)[r(s, π(s), s0) + γV (s0)] 7: ∆ ← max(∆, |v − V (s)|) 8: end for

9: until ∆ ≤ θ . a small positive value

10: StablePolicy ← True 11: for all s ∈ S do 12: b ← π(s) 13: π(s) ← arg maxa P s0T (s, a, s0)[r(s, a, s0) + γV (s0)] 14: if b 6= π(s) then 15: StablePolicy ← False 16: end if 17: end for

18: until StablePolicy == True

Dynamic Programming method known as Value Iteration performs the policy evaluation method directly according to Bellman Optimality Equation. We can consider that policy evaluation is truncated to only one backup, with policy improvement following directly afterwards. The pseudo-code is presented in Algorithm 2.

Algorithm 2 Value Iteration

Input: V (s) arbitrarily initialized, ∀s ∈ S

1: repeat 2: ∆ ← 0 3: for all s ∈ S do 4: v ← V (s) 5: V (s) ← maxa P s0T (s, a, s0)[r(s, a, s0) + γV (s0) 6: ∆ ← max(∆, |v − V (s)|) 7: end for

8: until ∆ ≤ θ . a small positive value

Since Value Iteration is able to estimate the value of a different policy than the one being followed, the method is said to be off-policy. Conversely, the Policy Iteration algorithm is said to be on-policy.

(45)

Figure 2.8 presents the backup diagram of Equation 2.11. This is a common visual-ization model for RL update-rules. It provides an easy way to observe how the successor states are considered. Time flows from the top to the bottom of the diagram. States are represented by white circles whereas black circles represent actions. The arcs emerging from action nodes to state nodes represent the observed rewards. The maximum operator is represented as a circle slice over emerging arcs.

max s

a

s' r

Figure 2.8: Bellman Optimality criterion backup diagram, used in Value Iteration, adapted from (Sutton and Barto 1998).

Policy Iteration and Value Iteration are both proved to be able to converge to the optimal policy (Sutton and Barto 1998). However, as one can easily observe, the methods are somewhat computationally expensive, having to perform several sweeps over the entire state set. Additionally, a drawback of dynamic programming methods is that they assume the existence of a complete and accurate model of the MDP. However, this is hardly ever available in real complex systems. The presented methods also have the particular feature that they do not require interaction with the environment in order to learn a policy. Given the model of the environment all adaptation is performed offline. Thus, Dynamic Programming methods are not considered as learning algorithms but MDP planners.

The next section presents a class of methods that is able to learn by interacting directly with the environment to estimate value functions, the Monte Carlo methods.

(46)

2.3 Monte Carlo Methods

Monte Carlo methods are able to estimate value functions through interaction alone, without requiring the presence of a model. The name derives from the fact that policies are evaluated from a random number of interactions with the environment. By interacting with the environment, an agent can observe the sequence of states, actions and rewards, and compute the average of the returns observed in different episodes. An episode, also known as trial, is a sub-division of the interaction with the environment characterized by an initial state and a terminal state. When the terminal state is reached, a new episode begins. Monte Carlo methods represent the most natural way of estimating a value function as a model of the expected return for all states in the environment. Monte Carlo methods are usually applied in episodic tasks, since well defined returns can be obtained from episode-based interactions. The return is simply computed by summing the observed rewards considering the discount factor, such as in Equation 2.2.

Algorithm 3 presents an example of a simple Monte Carlo method, the Every-visit Monte Carlo method. According to this update-rule, an agent interacts with the envi-ronment following a given policy π. After an episode terminates, the return of all visited states is computed according to the trajectory sampled during the episode. As the name implies, even if a given state is visited more than once in the same episode, samples of the return are calculated for every visit and considered for updating the expected value. This is in constrast to a first visit variant of the same algorithm which only computes the return for the first visit of each state.

Algorithm 3 Every-visit Monte Carlo method Input: π ← policy to be evaluated

Input: Return(s) ← an empty list, ∀s ∈ S

1: loop

2: Generate an episode using π

3: for all state s appearing in the episode do

4: R ← return following all occurrences of s in the episode

5: Append R to Return(s)

6: V (s) ← E{Returns(s)}

7: end for

8: end loop

Figure 2.9 presents the backup diagram of Every-visit Monte Carlo update-rule. It presents, in a convenient manner, the sum of all the observed rewards until the terminal

(47)

state is reached. Contrary to the backup diagram of Bellman Optimality equation (Fig-ure 2.8) where all the successor states are considered, the Monte Carlo backup diagram clearly shows the entire sequence of states, actions and rewards of a single episode, from the starting state to the terminal state.

...

terminal state st at rt+1 st+1 at+1 rt+2 st+2 aT-1 rT sT

Figure 2.9: Monte Carlo backup diagram, adapted from (Sutton and Barto 1998).

Although, the presence of a model is not needed to estimate the value function, since we are estimating a state-value function, we still need a model of the environment to obtain a policy, as shown in Equation 2.12.

Monte Carlo methods present a number of advantages. Unlike dynamic programming methods that estimate a value function by processing all states, a Monte Carlo update is

(48)

independent of the number of states, allowing this class of methods to focus on relevant areas of the state space while ignore other states. On the other hand, following a given policy may cause some states to never be visited. This is specially problematic when estimating an action value function. We want to estimate the value of all available actions for a given state, however, if the policy is deterministic, in an episode we are only able to estimate the values of one action per state.

Although Monte Carlo methods learn from interaction samples, the presence of an MDP model can still be very useful. Probability distribution models fully describe an environment, making them very difficult to obtain. On the other hand, sample models that approximately describe the environment are easier to obtain. These models can be used to generate simulated interactions with the environment that is then used by Monte Carlo methods. This allows Monte Carlo methods to learn from real and simulated interactions. This concept is the basis for the work developed in Chapter 5 (Integrating Model Learning and Planning in Batch Reinforcement Learning).

Monte Carlo methods are also characterized by not estimating the value of a state based on the value of other states. Therefore Monte Carlo methods do not bootstrap. Additionally, the presented Monte Carlo methods evaluate the value function according to the policy sampling the environment, therefore being on-policy.

In the next section, a class of methods is presented, that combines the advantages of Dynamic Programming and Monte Carlo methods: the Temporal Difference methods. These methods can learn from experience while also bootstrapping, reusing the values estimated in successor states.

2.4 Temporal Difference Methods

The last class of methods presented here to estimate value functions is the Temporal Difference (TD) methods. Temporal Difference methods are characterized by updating the value function on a time-step basis instead of having to wait for the episode termination like Monte Carlo methods. They do so by updating the current value function using the difference between the current value and the value of the successor state. This is often called the Temporal-Difference error :

V (st) ← V (st) + α[rt+1+ γV (st+1) − V (st))] (2.13)

(49)

difference methods is TD(0) presented in Algorithm 4.

Algorithm 4 TD(0) algorithm, adapted from (Sutton and Barto 1998) Input: V (s) initialized arbitrarily

Input: π the policy to be evaluated

1: loop

2: Initialize s

3: repeat

4: a ← π(s)

5: take action a; observe reward, r, and successor state s0

6: V (s) ← V (s) + α[r + γV (s0) − V (s)]

7: s ← s0

8: until s is terminal

9: end loop

Again, when estimating a state value function, methods such as Temporal Difference and Monte Carlo may avoid the need for models. However, to obtain a policy based on a state value function the model of the environment is needed, as shown in Equation 2.12. To learn and obtain policies without the need for models an action-value function must be estimated, as in Equation 2.6.

Q-Learning (Watkins and Dayan 1992, Watkins 1989) is arguably the most famous RL algorithm. This popularity is explained by the ability of the method to learn the value function using the collected experience from interactions alone. Since Q-Learning estimates an action-value function even the policy obtained does not require the presence of the MDP model. The method is presented in procedural form in Algorithm 5. After each

interaction with the environment, having observed the successor state s0 and the reward

r after applying action a in state s, the action value function is updated by combining Equation 2.13 and Equation 2.7.

The Q-learning backup diagram is presented in Figure 2.10. Since Q-learning updates a Q-Function, the top node of the backup diagram is a black circle representing the

state-action pair being evaluated. The white large circle represents the successor state s0. The

bottom black nodes represent the value of the different actions taken on state s0.

Similarly to Value Iteration, Q-Learning is an off-policy method, since it can estimate the value of a given policy while using a different policy to sample from the environment.

(50)

Algorithm 5 The Q-Learning algorithm

Input: Q(s, a) initialized arbitrarily ∀s ∈ S, ∀a ∈ A(s) Input: π ← policy to be evaluated

1: loop

2: Initialize s

3: repeat

4: a ← π(s)

5: take action a; observe reward, r, and successor state s0

6: Q(s, a) ← Q(s, a) + α[r + γ maxbQ(s0, b) − Q(s, a)] 7: s ← s0 8: until s is terminal 9: end loop

s,a

s'

r

a'

Figure 2.10: Q-learning backup diagram, adapted from (Sutton and Barto 1998).

2.5 Exploration/Exploitation Dillema

So far we have assumed that we can visit all states and actions infinitely often, having a guarantee that the optimal policy is obtainable. However, such scenario is very unlikely. For instance, when learning in a real robotic platform, the interaction with the environment is limited by a number of factors, such as hardware condition. If the agent always chooses the actions according to Equation 2.6 or Equation 2.12, the agent chooses his action greedily, since it is always exploiting its current knowledge regarding which actions are the best. By following greedy policies, an agent can sometimes get stuck on local maxima, following always the same (sub-optimal) actions. In order for an agent to determine which are indeed the best actions, it must sometimes choose actions that, according to its current knowledge, are sub-optimal. In other words, the agent must explore alternative actions. Since at each decision the agent can only apply one action, in RL, agents are faced with the dillema of exploiting the current greedy policy or exploring (sub-optimal) actions that may provide valuable insight into better ways to accomplish the task at hand.

(51)

Exploration can be incorporated in a RL agent by extending a greedy policy. The basic exploration strategy is to add noise to the action decision, known as -greedy policy, which chooses a random action with probability.

π(s, a) = (

1 − + _|A| if a = arg maxbQ(s, b)

|A| if otherwise

(2.14) with |A| being the size of the action set A.

While simple, when following this exploration strategy, the value of will play a major role. Setting the value very low, results in an agent that will not explore. Setting it too high will allow the agent to explore, but ultimately the agent will choose sub-optimal actions most of the time. Therefore setting a fixed value for is a major drawback. A common solution is to reduce over time, restricting exploration as more experience is collected.

While popular mainly due to its simplicity, -greedy exploration approaches present a particular drawback, since they choose sub-optimal actions with the same probability. Therefore it is likely that particularly bad actions may be chosen to be explored. Ideally we would want to explore actions according to their estimated returns. These are called softmax exploration strategies. One example relies on a Boltzmann distribution (Kaelbling et al. 1996), as shown in Equation 2.15. As with -greedy policies, this exploration strategy is defined by a free parameter τ , called the temperature. Setting a low temperature will reduce exploration, while high temperatures produce equiprobable actions. Despite the advantage over -greedy policies, softmax policies require more domain knowledge when defining the parameter τ .

π(s, a) = e

Q(s,a)/τ

P|A|

i=1eQ(s,ai)/τ

(2.15)

A different approach involves optimistically initialization of RL components. In a

model-free setting one can initialize an action value function with high values. As the value of chosen actions are corrected, exploration of other actions is performed (Brafman and Tennenholtz 2002). This results in a systematic exploration. On the other hand, in a model-based setting, the MDP model can be optimistically initialized. Usually, an opti-mistic MDP model involves setting all transitions to high valued terminal states. Again, as high valued actions are chosen the model is incrementally corrected and different actions will be chosen.

(52)

exploration strategy depends largely on the problem at hand. However, ideally a good exploration strategy explores when uncertainty, regarding the return, is high, and exploits its current knowledge when uncertainty is low. This concept is commonly known as opti-mism in face of uncertainty and it is represented in Figure 2.11. Gaussian distributions are used to represent the return estimated by a Q-function. While the average return given

by action a1 is the lowest of the available actions, it also possesses the highest associated

uncertainty. Picking this action will likely reduce the uncertainty. Therefore it is a good candidate for being chosen by an optimistic policy in face of uncertainty.

−50 −4 −3 −2 −1 0 1 2 3 4 5 0.5 1 1.5 2 Q p(Q) Q(s,a₁) Q(s,a₂) Q(s,a₃)

Figure 2.11: Optimism in face of uncertainty. The more uncertain an action, the higher the probability to be chosen.

The solution relies on the estimation of an upper confidence bound of the value of an action. Hoeffding’s Inequality bounds the probability of the expected mean of a random

variable X to be higher than a sample mean ¯X of N samples plus an upper limit U :

p(E[X] > ¯XN + U ) ≤ e−2N U

2

(2.16) This concept can be easily applied to the context of exploration such as the example of UCB1 algorithm (Auer et al. 2002). Considering a value function Q(s, a), UCB1 solves Equation 2.16 for U (s, a) assuming the probability decreases as more actions are chosen. The algorithm keeps a count N (s, a) of the number of times action a has been chosen in state s. UCB1 is presented in Equation 2.17.

(53)

π(s) = arg max

a [Q(s, a) + U (s, a)] = arg maxa

" Q(s, a) + s − log p 2N (s, a) # (2.17)

UCB1 plays a key role in the famous UCT algorithm (Kocsis and Szepesv´ari 2006),

an instance of the Monte Carlo Tree Search algorithms (Browne et al. 2012), that have been gaining relevance for being able to tackle very high dimensional and large learning environments, such as the game Go (Silver et al. 2012).

Alternatively, in a Bayesian setting with a probabilistic model of the Q-Function we can compute an optimistic policy in face of uncertainty according to Equation 2.18

π(s) = arg max

a Qµ(s, a) + cQσ(s, a)/N (s, a) (2.18)

with Qµ and Qσ being the mean and variance value of Q, respectively. c is a real value

that can promote or inhibit exploration based on the uncertainty of the expected return. Other heuristics can be used to promote exploration. For instance, the TEXPLORE algorithm extensively uses knowledge from models of the environment, to shape the reward, in order to target exploration (Hester and Stone 2012, 2013, Hester et al. 2013). Despite the large number of exploration strategies available, a thorough comparison has not been performed, in order to draw meaninful conclusions.

2.6 Function Approximators

Classical approaches represent value functions with lookup-tables. The table represen-tation allows the accurate estimate of the true value of all states (or state-action pairs). However it also presents some disadvantages. First of all this representation makes value-functions approaches extremely susceptible to the Curse of Dimensionality. This is because the higher the number of dimensions of the learning task the higher are the memory re-quirements. Even for modern computer standards some RL tasks are simply impossible to solve using a table due to this limitation. An obvious limitation of this representation is in the application in continuous state spaces. To apply a table representation, one needs to previously discretize the state space. Another drawback is that table value functions do not generalize across unvisited states, requiring that all states are visited. In large do-mains, such as the Backgammon and Go board games, or in continuous dodo-mains, such as a

Techniques for batch reinforcement learning in robotics

Jo˜

ao Alexandre da

Silva Costa e Cunha

T´

ecnicas para a Aprendizagem por Refor¸

co em

Lote na Rob´

otica

Techniques for Batch Reinforcement Learning in

Robotics

Jo˜

ao Alexandre da

Silva Costa e Cunha

T´

ecnicas para a Aprendizagem por Refor¸

co em

Lote na Rob´

otica

Techniques for Batch Reinforcement Learning in

Robotics

Contents

List of Figures

List of Tables

List of Algorithms

Chapter 1

Introduction

1.1

Motivation

1.2

Contributions

1.3

Outline of the Thesis

Chapter 2

Reinforcement Learning

Uncertainty

RL

Learning

Planning

2.1

Return and Value Functions

2.2

Dynamic Programming

improvement

evaluation

V / Q

π

2.3

Monte Carlo Methods

...

2.4

Temporal Difference Methods

s,a

s'

r

a'

2.5

Exploration/Exploitation Dillema

2.6

Function Approximators