The automatic synthesis of a policy through reinforcement learning (RL) from a given set of formal requirements depends on the construction of a reward signal and consists of the iterative application of many policy-improvement steps. The synthesis algorithm has to balance target, safety, and comfort requirements in a single objective and to guarantee that the policy improvement does not increase the number of safety-requirements violations, especially for safety-critical applications. In this work, we present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update. For the former, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification. For the latter, we introduce an algorithm ensuring that the policy is improved in a safe fashion with high-confidence guarantees. We also discuss the adoption of a model-based RL algorithm to efficiently use the collected data and train a model-free agent on the predicted trajectories, where the safety violation does not have the same impact as in the real world. Finally, we demonstrate in standard control benchmarks that the resulting learning procedure is effective and robust even under heavy perturbations of the hyperparameters.
翻译:在这项工作中,我们提出了一个综合问题的解决办法,解决了两个主要挑战:从一套正式要求中分红奖励,并更新了安全政策。对于前者,我们提议一个自动的奖赏分配程序,确定符合任务规格的标价奖励信号。对于后者,我们采用一种算法,确保以安全的方式改进政策,并作出高度自信的保证。我们还讨论采用基于模型的RL算法,以便有效地使用所收集的数据,并在预测的轨迹上培训一个没有模型的代理,在预测的轨迹上,违反安全并不产生与现实世界相同的影响。最后,我们在标准控制基准中表明,由此产生的学习程序即使在重度的透镜下也是有效和稳健的。