• Nenhum resultado encontrado

Reinforcement Learning for Robust Missile Autopilot Design

N/A
N/A
Protected

Academic year: 2023

Share "Reinforcement Learning for Robust Missile Autopilot Design"

Copied!
102
0
0

Texto

O desenvolvimento de controladores de piloto automático de mísseis foi uma tarefa complexa, dado o envelope de voo estendido e sua dinâmica não linear. Ainda não foi encontrada uma solução que possa sobressair tanto no desempenho nominal como na robustez face às incertezas.

Nomenclature

Acronyms

Introduction

  • Motivation
  • Topic Overview
  • Objectives
  • Thesis Outline

Some are basic RL concepts (cf. Section 3.1) and others are state-of-the-art RL algorithms and complementary features. After all this, the thesis will conclude with a brief summary of the most important achievements (see section 6.1) and some suggestions of directions for future research (see section 6.2).

Model

Modelled Dynamics

  • Forces and Moments
  • Flight Dynamics
  • Longitudinal Approximation
  • Reference Model

The dynamics of the rotation is characterized by equation (2.8), in which the dynamics of the body's velocity relative to the E-frameωE. Instead, the system will be required to follow shape reference signals, i.e., the output of the Reference Model.

Problem Formulation

The command reference signal (cf. section 4.3.1) is given as an input to the Reference Model, whose output, the formed reference signal, is the input of the agent. The former was defined as the L-2 norm of the four resting periods of the episode (cf.

Figure 2.1: Shaped and command reference signals
Figure 2.1: Shaped and command reference signals

Background

  • Reinforcement Learning - Basic Concepts
  • Trust Region Policy Optimization
  • Reward Engineering
  • Hindsight Experience Replay
  • Prioritized Experience Replay

One way to branch all possible RL algorithms is whether or not the agent has access to the dynamics of the environment. Basically, the critic is an estimator of the rewards that the agent will collect when it performs a certain action. Sampling efficiency is a qualitative measure of the amount of observations an algorithm must collect from the environment to achieve a given goal.

Aˆπ,γ can be defined as the sum of the first elements1 of the sequence of states and actions that occurred (cf. Along, perhaps, with the exploration strategy, the reward function is one of the main features of the success of an RL algorithm. Schaul's hypothesis et al. is that significance should be defined in terms of Time Difference (TD) error (cf.

For high-reward agents (see section 3.3), the effects of this rarity are limited to an increase in convergence time.

Application of Reinforcement Learning to the Missile’s Flight

Reward Engineering

  • Dense Exponential
  • Dense Sum
  • Sparse
  • Dense Sum with SER

Moreover, a tendency was also observed for the performance objectives (see section 2.2) to be interrelated. Given a random initialization, the agent's first interactions with the environment typically result in a performance far from the target performance (see Figure 4.5), which iteratively moves closer to the goal. Despite the advantages mentioned in section 4.1.1, the exponential profile adds an additional level of complexity regarding the tuning process of the relative weights (wi, see Appendix A).

Finally, after reaching a suboptimal policy marked by an oscillating behavior of η in rest periods (cf. figure 4.1), f4 (cf. In fact, the problems associated with sparse CORF (cf. section 3.3) hold, also, for dense However, contrary to sparse rewards (see section 4.1.3), dense rewards are, alone, quite efficient in bringing the agent to the middle/near region and, therefore, the SER (see section 4.2.5) was implemented together with some CORFs from section 4.1.2 that had shown good train results, possibly with minor changes.

Considering the implementation of Section 4.1.2, configuration 1 modifies w3 and w4 (cf. Appendix A), decreasing them when |ez| grows so that the agent can be more agile in following the rise of the reference signal.

Figure 4.1: Suboptimal Agent with oscillatory η in resting periods
Figure 4.1: Suboptimal Agent with oscillatory η in resting periods

Algorithm

  • Hyperparameters’ Configuration
  • Modifications to original TRPO
  • Hindsight Experience Replay
  • Balanced Prioritized Experience Replay
  • Scheduled Experience Replay
  • Algorithm Description

The exploration strategy is the key to the sampling of new actions, included in step 1 of Algorithm 2 (cf. Section 4.2). In both strategies, apart from the step amplitudes, all the other original parameters (cf. section 2.1.4) of the reference signal are preserved. Having HER (cf. Section 4.2.3) and BPER (cf. Section 4.2.4) dependent on a condition - the SER condition - is hereby defined as Scheduled Experience Replay (SER).

The reward function that led to the target performance (cf. section 5.2) is nevertheless not sparse and could already reach the near region without HER. The actions are sampled according to the policy and the proposed exploration strategy (cf. section 4.2.2). With the above configuration, after it has been collected, the episode is over and therefore, if the SER condition (cf. section 4.2.5) holds, it is replenished according to HER (cf. section 4.2.3).

If the SER condition (cf. section 4.2.5) holds, the training dataset is sampled according to BPER (cf. section 4.2.4).

Table 4.2: Hidden Layers of the Neural Networks a
Table 4.2: Hidden Layers of the Neural Networks a

Methodology

  • Command Reference Signal Generation
  • Intermediate Tests

The criteria in table 4.6 have been established as a softer version of the requirements that define the target performance (cf. section 2.2). The definition of the criteria used to stop the train is the key to the success of the implementation, since, as chapter 5 shows, longer trains do not always mean a performance closer to the target. Furthermore, the need for table 4.6 to present higher thresholds than those of section 2.2 lies in the fact that the proposed exploration strategy (cf. section 4.2.2) never allows η's variance to be zero and therefore the test performance can be better than the one observed during training.

During training, Algorithm 2 is run with periodic interruptions where it is put on hold and an intermediate test is implemented to keep track of its progress. In all cases, after any test episode where, of course, there is no investigation (cf. Section 3.1), all the criteria defined in Section 2.2 are checked and the train terminated successfully if met and resumed otherwise.

Figure 4.7: Command Reference Signal
Figure 4.7: Command Reference Signal

Reproducibility Assessments

To highlight this issue, another nine training sessions were conducted with the same configuration, the results of which are discussed in Section 5.3, making a total of ten agents trained under the same conditions, with the following difference: when the best agent is trained found (cf. Section 5.2), both neural networks were randomly initialized, while the remaining nine agents were run loading the same initialization. Such a decision is made to consider only the effects of the factors that make up the design choice of an algorithm and, therefore, may be the subject of future research (see section 6.2). Furthermore, it has been observed that an agent whose training starts by restarting the training progress of another agent faces some instability in the first few episodes, which may cause its policy to deviate so much from that of the loading agent, that the loading benefits are void.

For this reason, in the reproducibility experiments, the configuration described in Section 4.2.1 was fully preserved, except for the value of KL T ARGET, which was set to 0.0001, which reduced the radius of the confidence region by a factor of 30, in a more cautious implementation experiment training.

Robustness Assessments

  • Robustifying Trains
  • Robustified Agent

In each case, four different values ​​were tried for the limits of the range of possibilities (either maxor3σ), discarding the values ​​for which the robust train had disintegrated after 2500 episodes. Introducing latency into the system means slowing down the effect of the action in the environment. This non-nominal environment therefore differs from the nominal environment in the presence of delay elements at the agent's output.

In this case, the M andh are estimated, the dynamics of which are significantly slower than those of the measured quantities under the considered flight conditions. The parameterization of the model's coefficients based on labeled data is not flawless and therefore it is important to know the agent's behavior in case of uncertainty. Aerodynamic coefficients, such as Cz and Cm, are of utmost importance because they are the basis for the dynamics of the longitudinal approach of the GSAM.

5To be precise, this interval only covers 99.73% of the possible values, but it is assumed to cover the entire spectrum of possible values.

Results and Discussion

  • Expected Results
  • Best Found Agent
    • Training Evolution
    • Test Performance
  • Reproducibility Assessments
  • Robustness Assessments
    • Latency
    • Estimation Uncertainty
    • Parametric Uncertainty

Since after this peak the magnitude η and the noise levels return to their previous stabilized (good) state and that there is a significant reduction in the tracking error between idle periods (cf. Fig. 5.4), it is reasonable to say that the adopted exploration strategy is properly oriented, which not enough to produce the desired policy divergence. As Figure 5.7 demonstrates, the agent is clearly able to control the measured acceleration and track the reference signal fed to it, meeting all the performance requirements specified in Section 2.2 (see Table 5.1). As Figure 5.9 shows, of the four different lmaxtried values, only 5 ms converged after 2500 episodes, which means that this was the only robust set that ran for a total of 5000 episodes, during which the best agent found was identified as with delay Robustified.

Compared to the nominal agent (see Figure B.1), the nominal performance of the latter (see Figure 5.10) is damaged, with a less stable action signal and poorer tracking performance. As Figure 5.11 shows, of the four different values ​​of 3σ, only the 3% converged after 2500 episodes, meaning that it was the only reinforcing train that ran for a total of 5000 episodes, with the best agent found defined as the estimate. Uncertainty Robust agent. As Figure 5.13 shows, of the four different values ​​of 3σ, only the 5% value converged after 2500 episodes, meaning that it was the only reinforcing train that ran for a total of 5000 episodes.

When compared to the nominal mean (cf. figure B.1), the latter's nominal detection performance is damaged, but the action signal is smoother (cf. figure 5.14).

Figure 5.1: η’s evolution during best found agent’s training
Figure 5.1: η’s evolution during best found agent’s training

Conclusions

Achievements

Future Work

In fact, the reward technique (see 1) is the most time-consuming phase of the entire process and avoiding it is a priority. To this end, approaches such as having sparse reward functions [28] or no reward functions at all [35] can be a source of inspiration. As section 4.1.3 showed, it is very likely that such a drastic simplification of the reward function will require deeper adjustments in the RL algorithm itself, even to the point where the algorithm to be implemented – in this case TRPO – is chosen due to the reward function and not the opposite.

However, since we know that DQN is not suitable for this task due to its discrete action space, it would be interesting to implement DDPG [6], DQN's continuous action space subalgorithm, with sparse reward and HER. This is especially important if the objectives become more complex when the implementation is extended to the full flight dynamics model as suggested earlier. As for 3, research could be done to find different techniques for the optimization phase of the neural network instead of ADAM, as they certainly have a strong impact on how the training of the parameters dictates the reproducibility of the results.

Finally, there is also room for improvement regarding the hyperparameterization (see 4), more specifically, to identify the optimality domains of each hyperparameter and the advantages of search algorithms such as Random Search [37] in the matching process.

Bibliography

Soft actor-critic: Out-of-policy maximum entropy deep reinforcement learning with a stochastic actor.35th International Conference on Machine Learning, ICML. Interpolated policy gradient: merging on-policy and off-policy gradient estimation for deep reinforcement learning.

Appendix A

Reward Engineering

Appendix B

Nominal Agent used in the Robustness Assessments

Appendix C

Reproducibility Assessment Experiments

Appendix D

Nominal and Robustified Agents’

Robustness Comparison

Imagem

Figure 2.1: Shaped and command reference signals
Figure 4.1: Suboptimal Agent with oscillatory η in resting periods
Figure 4.2: Impact of the speed in the sigmoid function
Figure 4.3: Sigmoid-like decay of σ 2 log,tune
+7

Referências

Documentos relacionados

Conclui-se que n˜ao ´e interessante utilizar a estrat´egia que toma somente o timeout para identificar sess˜oes de usu´arios no ambiente estudado porque as regras de associac¸˜ao