强力导弹强化学习自动驾驶设计 (Reinforcement Learning for Robust Missile Autopilot Design)

Designing missiles' autopilot controllers has been a complex task, given the extensive flight envelope and the nonlinear flight dynamics. A solution that can excel both in nominal performance and in robustness to uncertainties is still to be found. While Control Theory often debouches into parameters' scheduling procedures, Reinforcement Learning has presented interesting results in ever more complex tasks, going from videogames to robotic tasks with continuous action domains. However, it still lacks clearer insights on how to find adequate reward functions and exploration strategies. To the best of our knowledge, this work is pioneer in proposing Reinforcement Learning as a framework for flight control. In fact, it aims at training a model-free agent that can control the longitudinal flight of a missile, achieving optimal performance and robustness to uncertainties. To that end, under TRPO's methodology, the collected experience is augmented according to HER, stored in a replay buffer and sampled according to its significance. Not only does this work enhance the concept of prioritized experience replay into BPER, but it also reformulates HER, activating them both only when the training progress converges to suboptimal policies, in what is proposed as the SER methodology. Besides, the Reward Engineering process is carefully detailed. The results show that it is possible both to achieve the optimal performance and to improve the agent's robustness to uncertainties (with low damage on nominal performance) by further training it in non-nominal environments, therefore validating the proposed approach and encouraging future research in this field.

翻译：设计导弹的自动试管控制器是一项复杂的任务,因为飞行包包宽广,而且没有线性飞行动态。目前还无法找到一个既能在名义性表现和稳健性方面优于不确定因素的解决方案。虽然控制理论常常在参数排程程序中穿孔,但加强学习在从视频游戏到具有连续行动域的机器人任务等更为复杂的任务中带来了有趣的结果。然而,它仍然缺乏关于如何找到适当奖励功能和勘探战略的更清晰的见解。鉴于我们所了解的不多,这项工作是提出加强学习作为飞行控制框架的先锋。事实上,它的目的是训练一个能够控制导弹长距离飞行的无型代理器,实现最佳性能和稳健性对不确定因素的稳健性。为此,根据TRPO的方法,所收集的经验在比照她,储存在缓冲和抽样中具有其重要性。这项工作不仅增进了在BPER中进一步重现优先经验的概念,而且对HER进行了调整,它也只是在训练进展接近非操作性政策时将其激活。事实上,事实上,它的目标是训练进展与不精细化的实地政策,因此,在研究周期性环境上是改进。