Reinforcement and imitation learning applied to autonomous aerial robot contro : Aprendizado por reforço e por imitação aplicado ao controle de robôs autônomos aéreos

(1)

INSTITUTO DE COMPUTAÇÃO

Gabriel Moraes Barros

Reinforcement and Imitation Learning Applied to

Autonomous Aerial Robot Control

Aprendizado por Reforço e por Imitação Aplicado ao

Controle de Robôs Autônomos Aéreos

CAMPINAS

2020

(2)

Reinforcement and Imitation Learning Applied to Autonomous

Aerial Robot Control

Aprendizado por Reforço e por Imitação Aplicado ao Controle de

Robôs Autônomos Aéreos

Dissertação apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Mestre em Ciência da Computação.

Dissertation presented to the Institute of Computing of the University of Campinas in partial fulfillment of the requirements for the degree of Master in Computer Science.

Supervisor/Orientadora: Profa. Dra. Esther Luna Colombini

Este exemplar corresponde à versão final da Dissertação defendida por Gabriel Moraes Barros e orientada pela Profa. Dra. Esther Luna Colombini.

CAMPINAS

2020

(3)

Biblioteca do Instituto de Matemática, Estatística e Computação Científica Ana Regina Machado - CRB 8/5467

Barros, Gabriel Moraes,

B278r BarReinforcement and imitation learning applied to autonomous aerial robot control / Gabriel Moraes Barros. – Campinas, SP : [s.n.], 2020.

BarOrientador: Esther Luna Colombini.

BarDissertação (mestrado) – Universidade Estadual de Campinas, Instituto de Computação.

Bar1. Aprendizado por imitação. 2. Robótica. 3. Aprendizado por reforço. I. Colombini, Esther Luna, 1980-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Aprendizado por reforço e por imitação aplicado ao controle de

robôs autônomos aéreos

Palavras-chave em inglês:

Imitation learning Robotics

Reinforcement learning

Área de concentração: Ciência da Computação Titulação: Mestre em Ciência da Computação Banca examinadora:

Esther Luna Colombini [Orientador] Hélio Pedrini

Reinaldo Augusto da Costa Bianchi

Data de defesa: 22-04-2020

Programa de Pós-Graduação: Ciência da Computação

Identificação e informações acadêmicas do(a) aluno(a)

- ORCID do autor: https://orcid.org/0000-0003-1122-968X - Currículo Lattes do autor: http://lattes.cnpq.br/9087770974893169

(4)

Gabriel Moraes Barros

Reinforcement and Imitation Learning Applied to Autonomous

Aerial Robot Control

Aprendizado por Reforço e por Imitação Aplicado ao Controle de

Robôs Autônomos Aéreos

Banca Examinadora:

• Profa. Dra. Esther Luna Colombini

Instituto de Computação - Universidade Estadual de Campinas • Reinaldo Augusto da Costa Bianchi

Centro Universitário FEI • Prof. Dr. Hélio Pedrini

Instituto de Computação - Universidade Estadual de Campinas

A ata da defesa, assinada pelos membros da Comissão Examinadora, consta no SIGA/Sistema de Fluxo de Dissertação/Tese e na Secretaria do Programa da Unidade.

(5)

Na robótica, o objetivo final do aprendizado por reforço é dotar os robôs com a capaci-dade de aprender, adaptar-se e reproduzir tarefas dinâmicas, baseadas na exploração e no aprendizado autônomo. O Aprendizado por Reforço (RL) visa solucionar esse problema, permitindo que um robô aprenda comportamentos por tentativa e erro.Com o RL, uma Rede Neural pode ser treinada como um aproximador de função para mapear diretamente estados para comandos do atuador, tornando qualquer estrutura de controle pré-definida desnecessária para o treinamento. No entanto, o conhecimento necessário para convergir esses métodos geralmente é construído a partir do zero; o aprendizado pode levar muito tempo, sem mencionar que os algoritmos de RL precisam de uma função de recompensa explícita e, às vezes, não é trivial definir uma. Muitas vezes, é mais fácil para um professor, seja agente humano ou artificial, demonstrar o comportamento desejado ou como realizar uma determinada tarefa. Os seres humanos e outros animais têm uma capacidade natural de aprender habilidades a partir da observação, geralmente apenas vendo os efeitos dessas habilidades, sem o conhecimento direto das ações subjacentes sendo tomadas. O mesmo princípio existe no Imitation Learning (IL), uma abordagem para sistemas autônomos adquirirem políticas de controle quando uma função explícita de recompensa não estiver disponível, usando a supervisão fornecida como demonstração de um especialista, normal-mente um operador humano.Nesse cenário, o objetivo principal deste trabalho é projetar um agente que possa imitar com êxito uma política de controle adquirida anteriormente usando o Imitation Learning. O algoritmo escolhido é o GAIL, pois consideramos que é o algoritmo adequado para resolver esse problema utilizando trajetórias de especialistas (estado, ação). Como trajetórias de especialistas de referência, implementamos métodos de ponta e fora da política, PPO e SAC. Os resultados mostram que as políticas apren-didas para os três métodos podem resolver a tarefa de controle de baixo nível de um quadrorotor e que todas foram capazes de generalizar além das tarefas originais, embora com uma performance inferior, conforme previsto.

(6)

In robotics, the ultimate goal of reinforcement learning is to endow robots with the ability to learn, improve, adapt, and reproduce tasks with dynamically changing constraints based on exploration and autonomous learning. Reinforcement Learning (RL) aims at addressing this problem by enabling a robot to learn behaviors through trial-and-error. With RL, a Neural Network can be trained as a function approximator to directly map states to actuator commands making any predefined control structure not-needed for training. However, the knowledge required to converge these methods are usually built from scratch, learning may take a long time, not to mention that RL algorithms need a stated reward function and sometimes it is not trivial to define one. Often it is easier for a teacher, human or intelligent agent, do demonstrate the desired behavior or how to accomplish a given task. Humans and other animals have a natural ability to learn skills from observation, often from merely seeing the effects of these skills: without direct knowledge of the underlying actions being taken. The same principle exists in Imitation Learning, a practical approach for autonomous systems to acquire control policies when an explicit reward function is unavailable, using supervision provided as demonstrations from an expert, typically a human operator. In this scenario, the primary objective of this work is to design an agent that can successfully imitate a prior acquired control policy using Imitation Learning. The chosen algorithm is GAIL since we consider that it is the proper algorithm to tackle this problem by utilizing expert (state, action) trajectories. As reference expert trajectories, we implement state-of-the-art on and off-policy methods PPO and SAC. Results show that the learned policies for all three methods can solve the task of low-level control of a quadrotor and that all can account for generalization on the original tasks.

(7)

2.1 Markov process . . . 19

2.2 Agent-Environment interaction loop. Extracted from [58] . . . 20

2.3 Multimodal Q-functions . . . 25

3.1 Snapshots of the different environments. . . 34

3.2 Block stacking task . . . 35

3.3 Learning system and task environment . . . 36

3.4 Snapshots of demonstrations and imitations . . . 38

3.5 Context translation in the reaching task . . . 39

3.6 Etimology of IfO - Imitating from observation . . . 40

3.7 Actor and Critic Networks . . . 41

3.8 Descending Maneuver . . . 42

3.9 Different simulation scenarios . . . 43

3.10 States of the Inspecting Drone Simulation . . . 43

3.11 Normalized Reward Graphs . . . 44

3.12 Different wing configuration . . . 45

4.1 SAC diagram . . . 48

4.2 PPO diagram . . . 48

4.3 GAIL diagram . . . 49

4.4 Quadrotor body frame . . . 50

4.5 Simulated quadrotor in Coppelia Simulator . . . 52

4.6 PWM x Force relation . . . 53

4.7 Initial values for the yaw angle . . . 57

4.8 Possible initial positions . . . 57

5.1 Average Returns in training PPO with Normal Reward starting at the goal 62 5.2 Distances for PPO with Reward R1 starting at the goal (I1) . . . 62

5.3 Angular Velocities for PPO with Reward R1 starting at the goal (I1) . . . 63

5.4 Distances for PPO with Reward R2 starting at the goal (I1) . . . 64

5.5 Angular Velocities for PPO with Reward R2 starting at the goal (I1) . . . 64

5.6 Distances for SAC with Reward R1 starting at the goal (I1) . . . 65

5.7 Angular Velocities for SAC with Reward R1 starting at the goal (I1) . . . 65

5.8 Distances for SAC with Reward R2 starting at the goal (I1) . . . 66

5.9 Angular Velocities for SAC with Reward R2 starting at the goal (I1) . . . 67

5.10 Distances for PPO with Reward R2 starting at I2 . . . 68

5.11 Angular Velocities for PPO with Reward function R2 and initialization I2 . 68 5.12 Distances for SAC with Reward R2 starting at I2 . . . 69 5.13 Angular Velocities for SAC with Reward function R2 and initialization I2 . 69 5.14 Average Returns in training PPO with initialization I3 Discretized_Uniform 70

(8)

5.18 Angular Velocities for SAC with Reward function R2 and initialization I3 . 72

5.19 PPO π∗ for the Line trajectory . . . 73

5.20 PPO π∗ for the Square trajectory . . . 74

5.21 PPO π∗ for the Sinusoidal trajectory . . . 75

5.22 SAC π∗ for the Line trajectory . . . 76

5.23 SAC π∗ for the Square trajectory . . . 77

5.24 SAC π∗ for the Sinusoidal trajectory . . . 77

5.25 Distances for GAIL trained upon PPO with Reward R2 and initialization I1 78 5.26 Angular Velocities for GAIL trained upon PPO with Reward R2 and ini-tialization I1 . . . 79

5.27 Distances for GAIL trained upon PPO with Reward R2 and initialization I2 80 5.28 Angular Velocities for GAIL trained upon PPO with Reward R2 and ini-tialization I2 . . . 80

5.29 GAIL trained upon PPO with the Line trajectory . . . 81

5.30 GAIL trained upon PPO with the Square trajectory . . . 82

5.31 GAIL trained upon PPO with the Sinusoidal trajectory . . . 83

5.32 Drone position in xoz-plane. Sinusoidal Trajectory with GAIL trained upon PPO pursuing a fast moving target . . . 83

5.33 Drone 3D position comparison between algorithms in a Line trajectory . . 84

5.34 Drone 3D position comparison between algorithms in a Square trajectory . 84 5.35 Drone 3D position comparison between algorithms in a Sinusoidal trajectory 85 5.36 Recovery from extreme conditions . . . 86

(9)

4.1 Parameters used in SAC algorithm. . . 59

4.2 Parameters used in PPO algorithm. . . 59

4.3 Parameters used in GAIL algorithm. . . 60

(10)

AI Artificial Intelligence

API Application Programming Interface

BC Behavior Cloning

CNN Convolutional Neural Networks DDPG Deep Deterministic Policy Gradient DRL Deep Reinforcement Learning

GAIL Generative Adversarial Imitation Learning GAN Generative Adversarial Networks

IL Imitation Learning

JS Jensen–Shannon

MDP Markov Decision Processes MLP Multi-layer Perceptron

NN Neural Networks

PID Proportional-Integrative-Derivative PPO Proximal Policy Optimization

PWM Pulse Width Modulation

RL Reinforcement Learning

SAC Soft Actor-Critic

TRPO Trust Region Policy Optimization VAE Variational Auto-Encoders

(11)

1 Introduction 13

1.1 Problem Description . . . 15

1.2 Motivation and Challenges . . . 15

1.3 Objectives and Contributions . . . 16

1.4 Research Questions . . . 17

1.5 Text Organization . . . 17

2 Theoretical Background 18 2.1 Reinforcement Learning . . . 18

2.1.1 Markov Decision Processes . . . 18

2.1.2 Agent-Environment Interaction . . . 19

2.1.3 Reinforcement Learning Algorithms . . . 22

2.2 Forward Deep Reinforcement Learning Algorithms . . . 24

2.2.1 SAC - Soft Actor-Critic . . . 24

2.2.2 PPO - Proximal Policy Optimization . . . 26

2.3 Imitation Learning . . . 27

2.3.1 GAIL - Generative Adversarial Imitation Learning . . . 29

3 Related Work 33 3.1 Advances in Imitation Learning . . . 33

3.2 Quadrotors and Reinforcement Learning . . . 40

4 Materials and Methods 47 4.1 Proposed Framework . . . 47

4.2 Drone Dynamics . . . 49

4.3 Coppelia Simulator (former V-REP) . . . 51

4.4 Hardware . . . 52 4.5 Agents/Models/Networks . . . 53 4.5.1 Drone Agent . . . 53 4.5.2 Reward Functions . . . 54 4.5.3 Initializations . . . 55 4.5.4 States . . . 56 4.5.5 Actions . . . 58 4.6 Algorithm Configurations . . . 58 4.6.1 SAC . . . 58

(12)

5.1.2 Initialization I2 . . . 66

5.1.3 Initialization I3 Discretized Uniform . . . 67

5.1.4 Control evaluation of best policies . . . 71

5.2 Gail on On-Policy expert trajectories . . . 78

5.3 Evaluating Gail’s control performance on different trajectories . . . 79

5.4 Robustness of the trained deterministic policies . . . 85

6 Conclusions and Future Work 87

(13)

Chapter 1 Introduction

The presence of robots in society is becoming ever more prevalent. Building versatile embodied agents, both in the form of real robots and animated avatars, capable of a broad and diverse set of behaviors, is one of the long-standing challenges of AI (Artificial Intelligence). However state-of-the-art robots cannot compete with the effortless variety and adaptive flexibility of motor behaviors produced by toddlers [89]. Indeed, crafting fully autonomous agents that interact effectively with their environments to learn optimal behaviors, and improve over time through trial and error, is still a challenge in the field of intelligent robotics [4].

In recent years, a demand for intelligent agents capable of mimicking human behavior has grown substantially. Indeed, advancements in robotics and communication technology have given rise to many potential applications that need artificial intelligence not only to make intelligent decisions but that is also able to perform motor actions realistically in a variety of situations.

While Robotics has shaped the way that the world manufactures its goods nowa-days, it was mainly due to Optimum Control, with restricting tolerances and, predomi-nantly, careful tuning of an expert. Although robotic applications in industry are suitable for high-volume production of a few different items (like automobiles), it cannot change rapidly for newer tasks, especially by a group of non-roboticists.

Generally speaking, a massive part of the universe of tasks that could be handled by robots is neglected due to a restrictive starting cost. However, the traditional algorithms applied – model-based RL (Reinforcement Learning) and Optimum Control algorithms – require a dynamics model, which describes how the system and the environment evolve, and a reward function to find an optimal control policy. Hence, the clear bottleneck relies on the problems related to model the dynamics of the systems.

In fact, for many problems, it can be very challenging to [3]:

• Write down a closed form of the control task. What is the objective function of "flying well while doing acrobatics" or "scoring a goal that best entertains the au-dience"?

• Provide an accurate dynamics model, with all the restrictions of data collection, complex nonlinear terms, maybe granular media, and non-rigid bodies.

(14)

• Find a near-optimal control policy, even when all the rest is appropriately set. If a more autonomous kind of robot is wanted, capable of quickly understanding the context in which it stands and what task (and how to perform it) would be more beneficial to it, it must have some other way of learning control policies. The reason is that we need dynamic adaptability to perform in a stochastic and complex world such as ours that has complex tasks that combine several minor behaviors and movement patterns. Model-free RL [69] is an area of research that tries to achieve such results by learning from experience and trial-and-error.

RL aims to solve the same problem as optimal control. Still, since the state transitions dynamics is not available to the agent, the consequences of its actions have to be learned by itself while interacting with the environment, by trial-and-error. At last, it can help to enhance robots with the ability to learn, improve, adapt, and reproduce tasks with dynamically changing constraints based on exploration and autonomous learning.

An RL agent interacts with its environment and, upon observing the consequences of its actions, it can learn to alter its behavior in response to rewards received. This paradigm of trial-and-error learning has its roots in behaviorist psychology [81], and the mathematical formalism is based on Optimal Control, especially dynamic programming [7].

Although RL had some successes in the past [83, 43, 2], previous approaches lacked scalability and were inherently limited to fairly low-dimensional problems. These limi-tations exist because RL tabular algorithms would have to store an immense amount of data if it were to discretize enough to get an elegant description of the continuous state space. Not to mention that the action-value functions are estimated separately for each state, without any generalization. This means that similar states are stored far from each other, as entirely distinct states.

As traditional RL tabular approaches, both Value and Policy-Based algorithms, used tables to update some value that receives a new and better estimate of its value in each iteration. However, for continuous state spaces, one has to use some function approximator to be able to get a lower-dimensional representation of that function that stills present acceptable results. For a while, linear function approximators were applied but could not account for the more difficult tasks [54], whereas non-linear function approximators were prone to diverge. Recently, approaches that apply, for instance, experience replay [54] to Neural Networks (NN), have allowed the use of this non-linear approximators for RL problems. The field of Deep Reinforcement Learning (DRL) deals with this approach.

Although recent advances in DRL have shown tremendous and promissory results, one of the most significant caveats of this solution is that the nature of the learning by gradient descent requires a considerable training time and a lot of interactions (they are the equivalent to samples in supervised learning). If we are accepting that the main driver for utilizing model-free algorithms is that we need solutions in a shorter deploy time, and with a lower influence of experts, it is seriously problematic to need large datasets for each task that the robot will learn how to perform. One way to mitigate these drawbacks is learning by demonstration, where a robot agent learns the optimal policy by observing an expert agent (that is expected to have the near-optimal policy). We will address this learning as Imitation Learning (IL).

(15)

Learning behaviors by IL, especially considering that we can learn from humans and/or other robots, is one of the most promissory new fields in robotics. This approach can significantly decrease the training time and, hopefully, in the future, have robots that can learn daily tasks just by watching them being performed on the Internet. Although nowadays robots apply a feature extraction process to guide its policy search while learning with the master agents’ demonstration, in the future, an end-to-end approach based on vision is expected to prevail.

In this work, we aim at studying how Imitation Learning can help Reinforcement Learning to address the problem of learning to perform tasks in the robotics context. By doing this, we aim at contributing to the state-of-the-art in the search for intelligent agents that can quickly grasp new behaviors and infer what they could (and what they need to) learn by absorbing other agents’ knowledge.

1.1 Problem Description

The problem that we want to address in this work is the intelligent control of a robotic agent, a UAV, to perform the task of go-to-target. We want to design a robust control-policy that can achieve the goal from a myriad of different starting poses.

The actuation in our controller is performed by a Neural Network, trained with Rein-forcement Learning (RL), which outputs the mean and standard deviation of four Gaus-sian distributions, one for each thruster. We sample random actions in the exploration phase, and in the evaluation phase, we use the mean of the distribution at its current timestep. The perception is made with ground truth information that we can access in the simulator.

The first part of this work consists of applying the Traditional RL approach (forward-RL) to find our π∗s. In RL theory, π∗ appoints to the optimal policies for a given task, but for our purpose is the quasi-optimum policies found over different scenarios. The acquisition of these policies is extremely hard and time-consuming as much of it is employed on testing hyperparameters, MDP configurations, and reward-shaping.

The second part consists of applying Imitation Learning (IL) to train a controlling policy based on expert trajectories to evaluate whether we can train an agent without the extraneous chore of reward-shaping.

All of our experiments are made in simulation, but they were designed in such a way that the trained policy can be easily transferred to a real robot. More details will be given later.

1.2 Motivation and Challenges

One of the biggest motivations in this work is that we need more adaptability to face a fast-paced and dynamic world. We can not continue to depend on really constrained robotic applications, with a pre-defined response to pre-defined tasks, as the way that the current automation works right now.

(16)

Another caveat is that it would be wanted that we can re-train agents with ease, getting satisfactory responses to novel tasks and, in the limit, abling non-specialists to show the agent what it needs to be done.

However, there is a great challenge to surpass since RL, and even DRL, is generally used with i) medium-to-high level decisions and ii) simple dynamic models. The field has yet to advance to get robust and safe DRL controllers for real mobile robots or robots simulated in more realistic simulators.

1.3 Objectives and Contributions

The primary objective of this work is to design an agent that can successfully imitate a prior acquired control policy using IL (Imitation Learning). The chosen algorithm is Generative Adversarial Imitation Learning (GAIL) since we consider that it is the proper algorithm to tackle this problem by utilizing expert (state, action) trajectories.

Although the research focus is related to IL, having a trained optimal policy π∗ that will work as a reference for the imitated behavior is mandatory. To tackle this necessity, we chose two state-of-art algorithms, one on-policy - the PPO (Proximal Policy Optimization) [36] - and the other off-policy - SAC (Soft Actor-Critic) [21], [22] - to perform the tasks and then to use the rollouts of the expert trajectories as the imitation process input.

In this context, the specific objectives of this work are:

1. O1: To apply a model-free on-policy algorithm (PPO) to learn an optimal policy π∗

that can perform low-level control on a UAV.

2. O2: To apply a model-free off-policy algorithm (SAC) to learn an optimal policy π∗

that can perform low-level control on a UAV.

3. O3: To evaluate the impact of distinct choices of MDPs (Markov Decision Processes),

hyperparameter and reward function on the quality of the policies learned in O1 and

O2;

4. O4: To use GAIL to imitate the learned policies (O1) aiming at acquiring an optimal

policy through Imitation Learning;

5. O5: To evaluate the impact of distinct choices of MDPs (Markov Decision Processes),

hyperparameter and reward function on the resulting policy of GAIL.

As contributions of this work, we can cite: i) an open-source version of GAIL applied to robotics agents in more complex scenarios than those available in the literature; II) the first use of SAC to learn a low-level control policy for Quadrotors; iii) a framework to use the Drone agent developed by [51, 52] with the plugin PyRep [35], that is a Python wrapper for the API(Application Programming Interface) of the new Coppelia Simulator [66].

(17)

1.4 Research Questions

As our research questions, we aim to answer:

• Q1: Can DRL achieve a suitable controller with a robotic UAV agent?

• Q2: Can Imitation Learning learn from a robust previously trained policy, achieving similar results?

• Q3: Can controllers in a specific task generalize well to similar objectives? In our case, we will train the controller with a fixed target and then evaluate it with a moving one.

• Q4: Can we replicate the study made in [51], i.e., achieving a good optimal control with PPO applied to a UAV go-to-target application?

1.5 Text Organization

This text is organized as follows. Chapter 1 presents the introduction of this text and how it is sectioned, whereas Chapter 2 presents a more in-depth discussion about RL, IL, and the background theory supporting them. Chapter 3 discusses the related works in the field of IL, approaching the state-of-art models, their drawbacks and advantages, as well as practices about the use of DRL in Quadrotor control. Chapter 4 describes the materials and methods that we use in our research and, finally, Chapter 5 shows the obtaining results. Chapter 6 closes the texts with the final remarks, conclusions, and indications of future work.

(18)

Chapter 2 Theoretical Background

In this chapter, we present the theoretical background related to Reinforcement Learning and the algorithms SAC, PPO, and GAIL.

2.1 Reinforcement Learning

Reinforcement learning (RL) [69] is a paradigm of machine learning used to understand and automate goal-direct learning and decision making. It defines the interaction between a learner (agent) and the environment concerning states, actions, and rewards [68]. The idea behind reinforcement learning has its origins in behavioral psychology, and it was ex-tended to computer science and other domains in the sense that a software agent improves its behavior to maximize a given reward signal. Formally, this process of learning through interaction with the environment is described by Markov Decision Processes, defined next (Section 2.1.1).

2.1.1 Markov Decision Processes

The RL task has better guarantees if it can be defined as a Markov Decision Processes (MDP). An MDP is an extension of a Markov Process that, additionally to the decision, satisfies the Markov property, which says that the next state of the process depends only on the current state. That is, given a sequence of states S, actions A and rewards R at each timestep, we have:

p(st+1, r|s, a) = Pr{St+1 = st+1, Rt+1 = r|St= s, At= a} =

= Pr{St+1 = st+1, Rt+1= r|S0, A0, R1, . . . , St−1, At−1, Rt, St, At}

(2.1) ,for all r, st+1, s and a, where s is the state, a the action, st+1 the next state, r the reward

and p or P are the associated probabilities.

Mostly, these processes are memoryless, and every state s must carry enough informa-tion to represent the environment at a given instant t accurately. Figure 2.1 illustrates a Markov process.

MDPs are defined as set of actions, states and transition probabilities, as M(S, A, P). In this work we will use the following augmented notation M(S, A, P, ρ0, γ) of an MDP,

(19)

Figure 2.1: Illustration of a Markov process and the transition probability at each pair of states [46].

which is commonly used in episodic reinforcement learning. Although not strictly nec-essary, there are better guarantees about the solution of RL tasks if the task can be described as an MDP.

The components of an MDP in regards to the environment are:

• S is the state space or the finite set of states in the environment. It defines the available and current state of the environment, which can be either discrete or continuous. It is a complete description of the state of the world;

• O is a partial description of the state which may omit information;

• A is the action space, the finite set of actions that an agent can execute. Executing an action will possibly give a reward and change the current state of the environment; • P(st+1, rt|st, at) is the transition operator. It specifies the probability that the

en-vironment will emit reward rt and transit to state st+1 for each state st and action

at;

• rt is the reward signal at a given instant t, as r ∈ R;

• ρ0 is the initial state probability distribution;

• γ ∈ [0, 1] is the discount rate used to adjust the ratio between the contribution of recent rewards and past rewards.

In most of RL problems, the dynamics of the environment, p(st+1, r|s, a), is not known

even if the environment follows the MDP formulation. However, if the transitions are known, there are algorithms that can find the optimal policy for the agent with no explo-ration, like Policy Iteration and Value Iteration algorithms.

2.1.2 Agent-Environment Interaction

The two main components of a reinforcement learning system are the agent and the environment. The agent is the component that learns and takes decisions (perceives and acts), and also receives a reward signal from the environment, a number that tells it how good or bad the current world state is. When the agent observes the complete state of the environment, we say that the environment is fully observed. The goal of the agent is to maximize its cumulative reward, called return. The environment is everything outside the agent that is subject to this interaction. It is the world that the agent lives in and

(20)

interacts with. At every step of interaction, the agent sees an observation of the state of the world and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its [58, 68]. Reinforcement learning methods are ways that the agent can learn behaviors to achieve their goal by interacting with this environment while accumulating rewards. Figure 2.2 depicts the Agent-Environment relationship.

Figure 2.2: Agent-Environment interaction loop. Extracted from [58]

Before we accurately define what is the return it may be appropriate to define the trajectories. A trajectory τ (τ is (sometimes also the symbol for the Bellman backup operator) is a sequence of states and actions in the world (environment), given by:

τ = (s0, a0, s1, a1, . . . , sn, an) (2.2)

The very first state of the world, s0, is randomly sampled from the start-state

distri-bution, sometimes denoted by ρ0 [58]:

s0 ∼ ρ0(·) (2.3)

State transitions (what happens to the world between the state at time t, st, and the

state at t + 1, st+1), are governed by the natural laws of the environment, and depend only

on the most recent action, at. They can be either deterministic or stochastic. Trajectories

are also frequently called episodes or rollouts.

The episodic reinforcement learning problem is an adequate approach for tasks that can be discretized in a series of steps. An episode starts with an initial state s0, sampled

from the initial distribution ρ0. Then, at each timestep t = 0, 1, 2, · · · , the agent samples

an action from the current policy π(a|s), reaching a new state st+1 and a reward signal

r(st), according to the distribution P(st+1, rt|st, at). An episode ends when a terminal

state is reached [72]. In practice, it is common to define a maximum number of steps that an episode can last, or even limit the environment exploration in some other manner so that the state space is reduced and learning is done faster.

The rewards or returns define the finite-horizon undiscounted return, which is the sum of rewards obtained in a fixed window of steps, given by:

R(τ ) =

T

X

t=0

rt (2.4)

(21)

However, in many cases, the notion of a final state is not naturally given, or the agent-environment interaction is not easily breakable into episodes, such as the task of controlling dynamic systems. Therefore, in continuing tasks, the final state is reached when T = ∞. Hence, the infinite-horizon undiscounted return, which is the sum of all rewards ever obtained by the agent but discounted by how far off in the future they are obtained includes a discount factor γ ∈ (0, 1) in its formulation:

R(τ ) =

∞

X

t=0

γtrt (2.5)

The discount factor, γ, with 0 ≤ γ ≤ 1, allows adjusting the importance of current and future rewards. From Equation 2.5, it is possible to notice that when the value of the factor γ is close to one, then η is more affected by future rewards and vice versa.

A policy is a rule used by an agent to decide what actions to take. It can be deter-ministic, in which case it is usually denoted by µ:

at = µ(st), (2.6)

or it may be stochastic, in which case it is usually denoted by π(st).

In Deep Reinforcement Learning (such as this work), we deal with parameterized policies, i.e., policies whose outputs are computable functions that depend on a set of parameters (e.g., the weights and biases of a neural network) which we can adjust to change the behavior via some optimization algorithm.

The parameters of such policy are denoted by φ (in some references θ), written as a subscript on the policy symbol to highlight the connection:

at = µθ(st)

at∼ πθ(st)

The goal in RL is to select a policy that maximizes expected return when the agent acts according to it. Consider both the environment transitions and policy stochastic. In this case, the probability over a T − step trajectory is:

P(τ |π) = po(so) T −1

Y

t=0

P (st+1|st, at)π(at, st). (2.7)

The expected return (for whichever measure), denoted by J (π) (we use gradient ascent to change the neural network’s weights to maximize the expected return), is then:

J (π) = Z

τ

P (τ |π)R(τ ) = E

τ ∼π[R(τ )]. (2.8)

Therefore, in RL we want to optimize π: π∗ = argmax

π

J (π), (2.9)

(22)

These definitions of policies and expected discounted returns are the basis for under-standing the concept of value functions, used in most reinforcement learning algorithms. Value functions are functions of the state s, which can be interpreted as a measure of how good it is for the agent to be at a given state. While each reward signal rt is a

metric of the immediate reward, the value function, defined in equation 2.10, also takes into account the possible rewards in future states. By value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. Vπ(st) = E τ ∼π R(τ ) so = s (2.10) and its optimum value:

V∗(st) = max π τ ∼πE R(τ ) so = s (2.11) In many cases, we can define it as a function of state-action pairs (s, a) that means how good it is to perform a given action in a given state, as described in Equation 2.12 of the Q-function (action-value function):

Qπ_{(s, a) = E} τ ∼π R(τ ) so = s, ao= a (2.12) and its optimum value:

Q∗(s, a) = max π τ ∼πE R(τ ) so = s, ao = a (2.13) We point the reader to [68] to a more formal definition of these terms and relations and [78] for a well-illustrated explanation of the Bellman Equation with these elements that we have just defined.

2.1.3 Reinforcement Learning Algorithms

There are many categories of RL solutions that can be more or less adequate for different domains, but several Reinforcement Learning algorithms available in the literature follow the approaches that we will present next.

Policy Optimization Algorithms

If we consider the case of a stochastic, parameterized policy, πθ, we aim to maximize the

expected return J (πθ) = τ ∼ πθR(τ ). For the purposes of this derivation, take R(τ ) as the

finite-horizon undiscounted return, but the derivation for the infinite-horizon discounted return setting is almost identical.

The goal is to optimize the policy by gradient ascent,

(23)

The gradient of policy performance, ∇θJ (πθ), is called the policy gradient, and

algo-rithms that optimize the policy this way are called policy gradient algoalgo-rithms.

To use this algorithm in high-dimensional continuous environments, we need an ex-pression for the policy gradient, which we can numerically compute. We have then to derive an analytical gradient of policy performance, leaving in the form of an expectation and, after that, make a sample estimate of this expected value, which can be computed with agent-environment interaction.

We point the reader to [59] to see the derivation that goes from the probability of a trajectory 2.7 to the expectation of the "grad-log-prob" 2.22 of a trajectory.

We can optimize the policy through gradient-based optimization or gradient-free meth-ods [38]. Generally, a variation of Stochastic Gradient Ascent is used to optimize an objective function L of the form:

LP G(θ) = ˆ_Et " logπθ(at|st) " X t r(st, at) ## (2.15) The regular vanilla policy gradients are susceptible to high variance when the objective function considers simply the reward to go, given by the cumulative reward that we have previously see. An alternative to reduce the variance of policy gradients, without introducing bias to the model, is to use an alternative objective function with a baseline b, as presented in Equation 2.16. LP G(θ) = ˆ_Et " logπθ(at|st) " X t r(st, at) − b ## (2.16) Subtracting a baseline is allowed since it is an operation that is unbiased in expectation. Any function b used in this way is called a baseline.

The most common choice of a baseline is the on-policy value function Vπ_(s

t). Recall

that this is the average return an agent gets if it starts in the state st and then acts

according to policy π for the rest of its life. Empirically, the choice b(st) = Vπ(st) has the

desirable effect of reducing variance in the sample estimate for the policy gradient. This results in faster and more stable policy learning. It is also appealing from a conceptual angle: it encodes the intuition that if an agent gets what is expected, it should feel neutral about it.

An actor-critic algorithm consists of a policy gradient method that works in association with a value estimator ˆVt(s). The actor is the policy that infers the best actions to take,

while the critic is the component that bootstraps the evaluation of the current policy. This structure is commonly modeled as two artificial neural networks, one for acting and other estimating ˆVt(s), but different architectures are also viable.

In practice, Vπ_(s

t) cannot be computed precisely, so it has to be approximated. This

is usually done with a neural network, Vφ(st), which is updated concurrently with the

policy (so that the value network always approximates the value function of the most recent policy). The simplest method for learning Vφ, used in most implementations of

(24)

φk= arg min φ st, ˆRt∼ πk Vφ(st) − ˆRt 2 , (2.17)

where πk is the policy at epoch k. This is done with gradient descent.

We can weight the log-prob of the policy with different kind of rewards but we can also weight it with the Q-function Qθ(st, at). The baseline is then set to b = ˆVt(s), which

brings up the concept of advantage Aπ, given by Equation 2.18.

Aπ = Qπ(st, at) − Vπ(st) (2.18)

The better the estimate of ˆVt(s), the lower the variance, and the overall learning is

more stable than when using vanilla policy gradient methods. We point the reader to the work of [73], which presents a method for approximating the advantage function in policy optimization algorithms and also dwells more in the different forms of weighting the policy.

Value function based Algorithms

There are algorithms that, instead of improving the policy, estimate the value (Equations 2.10, 2.12) of being in a given state. These RL methods aims to learn V (s) or Q(s, a) and, once an estimate of how good to be at each state is known,

π∗(s) = argmax_aQ(s, a) .

In the same way, the optimal policy π∗ has a corresponding value function V∗(s) = maxπVπ(s)

If the optimal value V∗(s) is known, we can retrieve the optimal policy by choosing among all actions available and vice versa [38].

In this RL category, there are important methods, such as SARSA [80] and Q-learning [90].

2.2 Forward Deep Reinforcement Learning Algorithms

In this section, we introduce the DRL algorithms that are applied in this work.

2.2.1 SAC - Soft Actor-Critic

The Soft Actor-Critic (SAC) algorithm [20, 87, 23] is based upon both actor-critic methods and maximum entropy reinforcement learning. Standard reinforcement learning aims to find the policy which maximizes the sum of rewards in all possible trajectories in the environment. Maximum entropy reinforcement learning expands the minimization problem by adding an entropy bonus to be maximized through the trajectory, encouraging exploration.

(25)

In Standard RL we want to find the policy parametrization that gives the maximum cumulative reward PT

t=0E(st,at)∼pπ[r(st, at)] over an episode of length T .

The Q-function Q(s, a) is the expected cumulative reward after taking action a at state s. When the agent is in the initial state, the Q-function may look like the one depicted in Figure 2.3 (grey curves), with two distinct modes corresponding to the two high-level decisions that the agent has to opt. A conventional RL approach is to specify a unimodal policy distribution, centered at the maximal Q-value and extending to the neighboring actions to provide noise for exploration (red distribution). Since the exploration is biased towards one decision, the agent keeps refining its policy and ignores the other decisions completely. Since the world is dynamic, a small change in it would be enough to make this agent fail. This could be overcome by an agent following the green curve (Figure 2.3) to update its probability distribution [82].

Figure 2.3: Multimodal Q-functions. Extracted from: [82]

At the high level, a solution to this problem is to ensure the agent explores all promising states while prioritizing the more promising ones. One way to formalize this idea is to define the policy direction in terms of exponentiated Q-values (represented by the green distribution in Figure 2.3), as in:

π(at|st) ∝ exp Q(st, at) (2.19)

This density has the form of the Boltzmann distribution, where the Q-function serves as the negative energy, which assigns a non-zero likelihood to all actions. As a conse-quence, the agent will become aware of all behaviors that lead to solving the task, which can help it to adapt to changing situations in which some of the solutions might have become unfeasible [82].

A more general maximum entropy objective which favors stochastic policies by aug-menting the objective with the expected entropy of the policy over ρπ(st), encouraging

exploration: J (π) = T X t=0 E(st,at)∼pπ[r(st, at) + α[H(π(·|st))] (2.20)

(26)

H(P ) = Ex∼P[− log P (x)] (2.21)

The temperature parameter α (Equation 2.20) determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy. This objective has some conceptual and practical advantages. First, the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. Second, the policy can capture multiple modes of near-optimal behavior. In problem set-tings where multiple actions seem equally attractive, the policy assigns equal probability mass to those actions.

The SAC version that we use [20] incorporates three key ingredients: an actor-critic architecture with separate policy and value function networks, and off-policy formulation that enables reuse of previously collected data for efficiency, and entropy maximization to enable stability and exploration. This method combines off-policy actor-critic training with a stochastic actor and further aims to maximize the entropy of this actor with an entropy maximization objective resulting in a considerably more stable and scalable algorithm that, in practice, exceeds both the efficiency and final performance of Deep Deterministic Policy Gradient (DDPG) [87].

This algorithm, in the end, aims to lift the brittleness problem of the DRL algorithms applied to continuous high-dimensional environments while maintaining the sample effi-ciency of off-policy algorithms. A more organized description of SAC can be found in 1.

Algorithm 1: SAC - Soft Actor-Critic

1 Initialize parameter vectors (networks) ψ, ¯ψθ, φ. 2 for each epoch do

3 for each environemnt step do 4 a_t∼ π_φ(a_t|s_t)

5 st+1 ∼ p(st+1|st, at)

6 D ← D ∪ {(st, at, r(st, at), st+1)}

7 end

8 for each gradient step do 9 ψ ← ψ − λ_V∇¯_ψJ_V(ψ)

10 θi ← θi− λQ∇¯θiJQ(θi) for i ∈ {1, 2}

11 ψ ← T ψ + (1 − T ) ¯¯ ψ

12 end

13 end

SAC has excellent convergence properties compared to its predecessors, needing fewer samples to reach good policies and finding policies with a higher reward.

2.2.2 PPO - Proximal Policy Optimization

The Proximal Policy Optimization (PPO) [74] method is an on-policy algorithm that, instead of optimizing the objective function presented in Equation 2.22, aims to maximize

(27)

the following surrogate objective function LCP I (conservative policy iteration) proposed in [39], subject to a constraint on the size of the policy update:

LCP I_{(θ) = ˆ} Et πθ(at, st) πθold(at, st) ˆ At (2.22) Although there are distinct PPO implementations that also use the KL (Kullback and Leibler) divergence ˆ_Et[KL[πθold(·|st), πθ(·|st)]] ≤ δ to constrain the surrogate objective,

the official paper uses only a clipping of the gradient step. This solves an unconstrained optimization problem instead of limiting it with the constraint priory described. PPO is the most widely used model-free on-policy algorithm until the time of this writing.

Consider the probability ratio r(θ) = πθ(at,st)

π_θold(at,st), the new objective function is given by

LCLIP(θ) = ˆ_Et h min r(θ) ˆAt, clip (r(θ), 1 − , 1 + ) ˆAt i (2.23) where is a hyper-parameter (we have used = 0.2). In [36], the authors explain that this is possible to adjust the size of the policy update effectively by taking the minimum of the clipped and the unclipped objective LCP I.

Our PPO algorithm implementation is a little different from the ones in [74], and [24] and is described in the pseudo-code presented in Algorithm 2.

Algorithm 2: PPO - Actor-Critic Style

1 for i ∈ 1, . . . , N do

2 Run policy π_old in environment for T timesteps, collecting {s_t, a_t, r_t} 3 Compute advantages estimates ˆA_t=P_t0_>trt

0

− Vφ(st) 4 for j ∈ 1, . . . , opt_epochs do

5 for k ∈ 1, . . . ,_{minibatch_size}batch_size do

6 Optimize surrogate objective LCLIP w.r.t. θ 7 Optimize value function _size1 PT_t=1(P_t0_>tγ(t

0₋₁₎ rt0_{− V} φ(st))2 w.r.t. φ 8 θ_old← θ 9 end 10 end 11 end

2.3 Imitation Learning

Imitation learning (IL) refers to an agent’s acquisition of skills or behaviors by observing a teacher demonstrating a given task. With inspiration and basis stemmed in neuroscience, imitation learning is an important part of machine intelligence and human-computer in-teraction, and it has, from an early point, been viewed as an integral part of the future of robotics [71, 32]. It is a recent trend in robotics, mainly due to the difficulty of tra-ditional ML algorithms to scale to high-dimensional agents with high degrees of freedom [41, 76, 45].

(28)

Humans have evolved to live in societies, and a major benefit of that is the ability to leverage the knowledge of parents, ancestries or peers to aid their understanding of the world and more rapidly develop skills deem crucial for survival [8]. Our species, by far, is the species that has best improved the collective learning and the way it can pass the knowledge for other generations by other means other than DNA. It is safe to say that a good part of our learning is done by observing other agents.

As a Machine Learning problem, specifically related to RL, imitation is the problem of learning a control policy that mimics a behavior provided via demonstration. Unlike other areas of Machine Learning, like image and audio modeling, the observations are constrained by the environment and depend on the agent’s interactions [50].

IL couples well with RL because one can infer that direct imitation is not enough to reproduce robust, human-like behavior in intelligent agents. The majority of the problems are due to generalization, along with the problem of demonstration, wherein particular finely precision tasks could provide a challenge to imitate. It is easy to conceive that a visual demonstration of a robot walking is not enough to teach another robot to walk adequately. Indeed, since it does not contain information about the contact of the teacher robot’s feet with the ground and its center of mass location, among other information.

If an RL policy is learned by demonstration, RL can be applied to fine-tune its pa-rameters, and it can use positive and negative examples to reduce its search space. It can enhance the policy if there are considerable discrepancies between the agent and the teacher and even alleviate errors in the acquisition of demonstrations [32]. It can be used to make the learning agent learn a policy that was not present in the demonstration distribution while avoiding the risk of falling into local minima.

IL is usually thought of as the problem of learning an expert policy that generalizes to unseen states, given many expert state-action demonstrative trajectories [50]. One of the key challenge in IL is the correspondence problem [56, 13]. If the teacher and learner’s body differs, it is not trivial to find a mapping of the teacher’s demonstrations. Furthermore, different points of view (the learner is in third-person), differences in the anatomies, as well as different capabilities (dynamic properties), such as robotic joints that cannot achieve the same velocity or different strength in agent’s muscles, can lead to demonstrate behaviors that may not be executable by the learner [13].

Imitation Learning (IL) can be divided into Behavior Cloning (BC) and Inverse Re-inforcement Learning (IRL).

Behavior Cloning

In BC, the policy is generally learned in a supervised way, by the state-actions tuples provided by an expert [64, 67]. It is a straightforward approach, with a smoothing of noisy demonstrations [13]. However, it requires large training datasets, also tending to be brittle and fail when the agent diverges considerably from the demonstration trajectories [89, 5].

It is a Supervised Learning (SL) problem where we fit a model to a dataset of experts (trajectories, contexts). For example, neural networks can be trained with images of human-driven cars to derive a policy based on medium-level actions (like steer right, stop the car). While this supervised learning approach is well-defined, a problem arises in

(29)

this setup when there are too many state-transitions not visited by the expert. Also, when the agent cannot make sense of its dynamics only by seeing trajectory data without interacting with the environment, the system might fail.

The methods in BC maps (states, actions) - and sometimes contexts - to actions directly without recovering the reward-function. BC is more indicated when the SL ap-proach is the most secure/parsimonious way to represent the derived behavior wanted to guide the learning. It is a good approach to follow before the training phase as a transfer learning that helps, especially in end-to-end setups that use images/videos. It can also work as an indicator of hierarchical control, since it is easier to follow trajectories with medium-to-high-level controls, leaving the complicated low-level actions to already established controllers.

In this setup, an expert trajectories dataset must be available (or we assume that we can query the expert during training). The dataset generally is a set of (trajectories, contexts) (in this work, the context is always the same). The dataset can also be, such is our case, a set of (state, actions) pairs. There are several methods for solving this problem by choosing a L cost-function that states the similarity between the demonstrated behavior and the behavior performed by the learner’s policy in each training epoch.

Inverse Reinforcement Learning

In IRL the reward function is inferred by the expert demonstrations [3, 93, 47, 79, 14] and then it uses this with RL methods to optimize it. IRL is obviously suitable for tasks where the hand-craft of reward functions is non-trivial, such as parking lot navigation [1] and helicopter acrobatics [2]. More modern approaches to IRL alternate between forward and inverse RL [28, 15], while in practice working directly on the high-dimensional observations has been difficult [50].

IRL seeks a valid policy π and a reward function under which the expert’s behavior is optimal. The task is formulated as a saddle-point optimization problem. It optimizes the agent π in the inner loop while searching for the reward-function in the outer loop. This is the main reason why traditional IRL is too time consuming and not recommended for large action-state-spaces such as 3D-robotics.

The goal of IRL is to recover the unknown reward function R(τ ) from the expert’s trajectories. However, since policy can be optimal for multiple reward functions, the problem of determining the reward function is "ill-posed." To obtain the unique solution in IRL, many studies have proposed additional objective functions to be optimized, such as margin between the optimal policy and others [57, 3, 63, 77] and to maximize the entropy [95, 94, 75].

2.3.1 GAIL - Generative Adversarial Imitation Learning

As it was stated previously, IRL algorithms are very computationally demanding, re-quiring to solve RL in the inner loop. It makes this algorithm hard to scale for large environments. However, the authors of GAIL [28] propose a new general framework for directly extracting a policy from data, like it was derived by RL through IRL. Their formulation allowed the use of IRL to complex and high-dimensional DRL setups.

(30)

A specific example of the IRL algorithm called maximum causal entropy IRL [95], which fits a cost function from a family of functions C with the optimization problem [61]: max c∈C min π∈Π Eπ[c(s, a)] − H(π) − EπE[c(s, a)]) (2.24)

where C is a class of cost functions, c(s, a) is a cost-function c : SxA → R, π is the agent policy which is being trained (actor), H(π) is the γ-discounted causal entropy H(π),_{= E}∆ π[−logπ(a|s)].

Maximum causal entropy IRL seeks a cost function c ∈ C that assigns low cost to the expert policy and high cost to other policies, thereby allowing the expert policy to be found via a certain reinforcement learning procedure [61]:

RL(c) = arg min π∈Π −H(π)+EπE[c(s, a)]) min π∈Π Eπ[c(s, a)]−H(π) −EπE[c(s, a)]) (2.25)

which maps a cost function to high-entropy policies that minimize the expected cu-mulative cost.

Ho and Ermon [28] came with a solution and proposed practical algorithms for using IRL in high-dimensional setups. They formulated a dual problem to IRL, which finds a proper policy without explicitly finding a reward function [28]. Their imitation learning algorithm finds a policy close to an expert’s in terms of occupancy measures p. The optimization problem is formulated as follows:

minimize

π ψ(pπ − pE) − H(π)) (2.26)

As mentioned previously, Inverse Reinforcement Learning (IRL) learns a cost function based on observed trajectories. A policy is then trained using this cost function. Ho and Ermon [28] describes the occupancy measure as the distribution of state-action pairs, which are generated by the policy π. Mathematically the occupancy measure is defined as: pπ(s, a) = π(a|s) ∞ X t=0 γtP (st = s|π) (2.27)

Occupancy measure (Eq. 2.27) can be understood as a discounted joint probability of state-action pair. Regularization function φ smoothly penalizes violations in the difference between the occupancy measures of agent and expert. Authors have shown that there exists a one-to-one correspondence between policy and occupancy measure, which means the closing agent’s occupancy measure is to experts; the more similar the policies are [61]. The authors in [28] say that changing the cost regularizer in the occupancy measure formulation causes to the recovery of different know algorithms in IRL. In particular, they propose a new method that uses JS (Jensen–Shannon) divergence, Generative Adversarial Imitation Learning (GAIL). A connection can be drawn between GAIL and GANs (Gen-erative Adversarial Networks) [19, 61, 18], which train a gen(Gen-erative model G by having it

(31)

confuse a discriminative classifier. D, as this formulation, can be used to minimize a real metric (JS divergence) between the two occupancy measures (Eq. 2.28).

min

π φGA(c)(pπ − pπE) − λH(π) = DJ S(pπ, pπE) − λH(π), (2.28)

The optimization procedure is implemented as a generative adversarial network. The algorithm simulates a zero-sum game between discriminator network D and generative model network G. The model generates data, and its job is to be as much as possible close to actual data distribution. The job of D is to distinguish between the distribution of data generated by G and the actual data distribution. When D cannot distinguish generated data from the real data, then the generator has managed to learn actual data distribution. The occupancy measure pπ is analogous to the data distribution of the

generator G (GAN formulation), and the expert’s occupancy measure is analogous to the true data distribution. In our case, G is generating pπ and its job to find a proper policy

π, such that its occupancy measure resembles the expert’s one well.

The goal of GAIL is to find the saddle point of Equation 2.29, which is the same objective as the GAN objective [18] subtracted by the policy regulariser,

min

π maxD Eπ[log D(x)] + Eπ E[1 − log D(s, a)] − λH(π) (2.29)

where π is the parameterized model (actor or generator), π_E is the expert model or the expert trajectories generated by the expert model, D(s, a) is the Discriminant network and H is a policy regulariser where λ > 0.

First, the weights of the discriminator network D are updated by an ADAM gradient step based on the generated trajectories. Secondly, a Trust Region Policy Optimization (TRPO) update step is performed on weights θg concerning φ. Ho & Ermon [28] has chosen TRPO because it clips the change in the policy. We have chosen PPO to do so since the algorithms work by the same principle.

The training procedure for GAIL is described in algorithm 3. In line six, the weights of the discriminator network are updated. Lines seven and eight updates the policy weights θg using the TRPO update step.

Algorithm 3: GAIL

1 Initialize Expert trajectories τ_E ∼ π_E

2 Initialize Policy Network with parameters θ at random

3 Initialize Discriminant network with parameters φ at random or by behavior

cloning

4 for epoch ∈ {1, . . . , num_epochs} do

5 Sample expert trajectory from τ_E or generate one with π_E∗ 6 Update Discriminant in E_π[log D(x)] + E_{π E}[1 − log D(s, a)] 7 Run PPO on inner loop , where rt= − log D(st, at)

8 end

GAIL has significant performance gains over IL methods and can imitate complex behavior in large, high-dimensional environments. It is a definite candidate for mimicking

(32)

complex low-level controllers for quadrotors.

In the next chapter, we will discuss some works focusing on IL and on the use of DRL to control UAV that will help us understand the issues and setups used to solve our target problem. It will also give us ideas for defining our reward function or even our space-state

(33)

Chapter 3 Related Work

In this chapter, we highlight the works that are strictly related to our research goals, especially those that convey some end-to-end imitation learning approach and apply DRL to UAV control.

In Section 3.1 we present some works that perform Imitation Learning employing Deep Reinforcement Learning. Section 3.2 discusses research that used DRL to control UAVs, either to low-level control or high-level decisions on top of pre-programmed traditional controllers.

There are not many scientific papers using DRL to control Drones because the opti-mum control theory already performs well in this task, but this is rapidly changing when more applications that employ images also arise. Image-Based Visual Servoing (IBVS) [70] is an example of such a shift in the scenario.

3.1 Advances in Imitation Learning

Borsa et al. [8] use DRL and IL to a simple 2D go-to target task, with different levels of difficulty (for example, number of rooms) where they proposed a way to treat the problem of Imitation Learning by adding the observation state of the master agent in the observation state of the learning agent. However, it does not represent a traditional form of Imitation Learning since the learner agent did not receive the state-action tuples.

This work utilized a two-dimensional mesh grid and curriculum training, where the task had its difficulty subsequently increased while the results were compared with each other. The different configurations were: a learner (learning agent) trying to figure out the best policy alone in the world, a learner with access to the master’s position ground truth (anywhere on the map), and the learner with only a spatial view of the master agent (losing the contact when occlusion occurred). They compared the impacts of adding curriculum learning and transfer learning curriculum on the performance of the learner agent.

Figure 3.1 depicts snapshots of the environments of the simple navigation tasks eval-uated in [8], where there are learning and a teaching agent, and the corner of the rooms are the possible goal locations.

(34)

Figure 3.1: Snapshots of the different environments. Extracted from: [8].

Despite the simplicity of the domain employed, the method of study was promising. It confirmed that, instead of learning with the master agent how the world works and how one can achieve the same goal, the learner only learns to mimic the master’s behavior accurately. Not only this is an obvious drawback in the solution, but it can be detrimental in a myriad of tasks since the study showed that the learner gets lost in the absence of the teacher (what could delay the learning process to a more time-consuming process than the case in which the learner is alone). For example, if in a subsequent and harder level there is some location where the learner (subject to transfer learning) consistently loses sight of the master, it could get stuck where an agent that learns from scratch would continue to explore the environment, consequently finding its way to the optimal policy. The authors have proposed a solution to address this problem: to mask (turn invisible) the teacher agent with some probability P that keeps increasing while training, leading to the complete absence of the master in the final learning epochs. Although they claimed that the problem was tackled, they did not present any results, which could be an indication that the solution did not deliver remarkable results.

In another work, Zaremba et al. [12] use meta-learning and imitation learning to solve a task of stacking blocks with a Fetch Robot. While their goal is to quickly abstract to an unseen task, they use regular imitation learning since they are learning from expert demonstrations.

They state that IL has been commonly applied to solve different tasks in an isolated fashion. This demands expert feature engineering and/or a great mass of data. It goes against our goals since one could argue that a user would want (and expect) that robots should have an extreme capacity of learning with few demonstrations, significantly sur-passing human beings, and quickly generalize to other situations that look alike. They propose to apply meta-learning, where the algorithm itself is learning to generalize to a different set of requisites and data.

Figure 3.2 presents an example of the block stacking task addressed in this work, which aims to control a Fetch robotic arm to stack blocks into various layouts.

(35)

Figure 3.2: Block stacking task. Extracted from: [12].

In this work, the actual objective is to maximize the expected performance of the learned policy when faced with a previously unseen task and having received only one demonstration of the task as input. The learned policy takes as input: (i) the current observation and (ii) one successful demonstration that solves a different instance of the same task (this demonstration is fixed for the duration of the episode). For example, if in one configuration there are several blocks scattered on the table, with same letters on the sides of the block (each block corresponds to a letter, and the robot have the visual understanding to what letter is representative of that block), so one instance could be four towers of two blocks ab cd ef gh. Another may contain two towers of four blocks abcd efgh.

To handle the processing of the sequences of state and actions corresponded to the demonstration, as well as the processing of the vector specifying the locations of the elements in the environment, soft attention [6] was used. This generalized that the re-searchers were looking for possible. As the algorithm is concerned, any method for policy learning could be applied. The researchers opted to IRL approaches, such as DAGGER [67], because they considered that this (only giving demonstrations and not reward func-tions) would be more scalable for a different myriad of tasks. One of the main contribu-tions alleged for the work is the architecture proposed consisting of three modules: the demonstration network, the context network, and the manipulation network. The study also introduces a novel technique named neighborhood attention operation.

(36)

that the one that achieved better generalization was the one utilizing only the final stage, where the agent did not receive the information about intermediate steps. Although, as one could expect, this approach lacks good performance on more complicated tasks since, for example, the agent can find itself without the knowledge of which block to stack first. Finally, this work discretized three different forms of failed tasks: wrong move (where the block configuration was wrong), manipulation failure (error of "hand" manipulation where the policy did not know how to recover), and recoverable failure (where the policy ran out of time). They found out that the majority of the failures came from the ma-nipulation, showing that a simple NN was not enough to handle the difficult problem of dextrous manipulation. The authors were beginning to explore this model and promised to extend these experiments to a broader range of tasks and scenes.

Nair et al. [55] presented a system where the robot can manipulate a rope into target configurations by combining a high-level plan provided by a human with a learned low-level model of rope manipulation.

They propose a learning approach to associate the behavior of a deformable object (rope) with the actions performed by a robot by using supervised learning on a significant dataset gathered autonomously by the robot. In this case, the robot learns an inverse dynamics model that is goal-driven: given a current state and a goal state, both in image space, the model will predict (output) the action that will lead to the completion of the goal.

In Figure 3.3 there is an explanation of the learning system where the robot is capable of manipulating a rope into target configurations by combining a high-level plan provided by a human with a learned low-level model of the rope manipulation. The robot uses a Convolutional Neural Networks (CNN) for learning the inverse model by watching a demonstration of rope manipulations made by a human.

Figure 3.3: Learning system and task environment. Extracted from: [55].

During the experiments, the agent was left alone with a rope, free to manipulate it at its will to learn self-supervision. It had a heuristic to pick it off the ground and to put it back on the table if it was the case since it is very likely that the rope will fall off the table

(37)

or drift to be out of reach. To use the minimum of human intervention and to achieve a continuous data collection, the robot resets operation after 50 actions or if less than 1000 pixels of rope is found in the image. The robot uses the point cloud from the Kinect camera to properly segment the rope, and then it chooses at random a point to pick the rope, being this the pick point. After that, the drop point is set as a displacement vector from this pick point.

The researchers claimed that after the robot has learned the model, their algorithm can use human-provided demonstration as a high-level guide. Hence, their model would tell how to do (as in low-level dynamics) the act passed by the human instructor.

The model utilized in this paper does not use any parametrization of the rope. It only uses raw images of the rope, which lifts the hard task of specifying kinematic/dynamic models that could not precisely map well to a deformable object. The feature extractor part of the end-to-end model is Deep Convolution Neural Networks (CNN’s).

Although this paper is very interesting, for more complex tasks such as rope knotting, the model was successful only 38% of the time. In such cases, we hypothesize that it is necessary not only to evaluate the results in comparison with model-free benchmarks but also with hardcoded methods to assess whether the contribution for the general fields of robotics is indeed valid in a practical way.

Wang et al. [89] used Imitation Learning for the tasks of reaching, with a 9 DoF robotic arm, and motor control, with the environments Walker and Humanoid of the OpenAi Gym framework [9]. The authors have proposed an approach to combine the two advantages of supervised learning through Variational Auto-Encoders (VAE) [40] that can achieve one-shot imitation learning (with the proper training). However, there are several problems when the trajectory diverges considerably from the demonstrations available at training time and GAIL (Generative Adversarial Imitation Learning) [28], that can learn robust controllers with fewer demonstrations, but is inherently mode-seeking and challenging to train/teach. Their model uses a type of autoencoder applied to the demonstrated trajectories that can learn semantics policy embeddings. So the embeddings can be learning with a reaching task and then be smoothly interpolated in embeddings of reaching behavior.

In Figure 3.4, on the upper row, there is the demonstration in training and test sets, in the humanoid domain simulated in the software Mujoco [85] and the bottom row is the imitation achieved by the learning agent.