Reinforcement learning (RL) has achieved enormous progress in solving various sequential decision-making problems, such as control tasks in robotics. Since policies are overfitted to training environments, RL methods have often failed to be generalized to safety-critical test scenarios. Robust adversarial RL (RARL) was previously proposed to train an adversarial network that applies disturbances to a system, which improves the robustness in test scenarios. However, an issue of neural network-based adversaries is that integrating system requirements without handcrafting sophisticated reward signals are difficult. Safety falsification methods allow one to find a set of initial conditions and an input sequence, such that the system violates a given property formulated in temporal logic. In this paper, we propose falsification-based RARL (FRARL): this is the first generic framework for integrating temporal logic falsification in adversarial learning to improve policy robustness. By applying our falsification method, we do not need to construct an extra reward function for the adversary. Moreover, we evaluate our approach on a braking assistance system and an adaptive cruise control system of autonomous vehicles. Our experimental results demonstrate that policies trained with a falsification-based adversary generalize better and show less violation of the safety specification in test scenarios than those trained without an adversary or with an adversarial network.
翻译:强化学习(RL)在解决各种顺序决策问题中(如机器人控制任务),取得了巨大的进展。由于策略过度拟合训练环境,RL 方法通常无法推广到安全关键的测试场景。之前提出了鲁棒的对抗 RL(RARL),用于训练应用扰动于系统的对抗网络,从而提高测试场景下的鲁棒性。然而,神经网络对手的一个问题是,集成系统要求而无需手动构建复杂的奖励信号是很困难的。安全伪造方法允许找到一组初始条件和输入序列,使得系统违反了在时间逻辑中制定的给定属性。在本文中,我们提出了基于伪造的 RARL(FRARL):这是第一个通用框架,用于在对抗学习中集成时间逻辑伪造以提高策略的鲁棒性。通过应用我们的伪造方法,我们不需要为对手构建额外的奖励函数。此外,我们评估了我们的方法在自动驾驶汽车的刹车辅助系统和自适应巡航控制系统上的表现。我们的实验结果表明,用伪造对手训练的策略比那些没有对抗或使用对抗网络训练的策略更好地推广,并且在测试场景中显示出更少的安全规格违规。