We study the robustness of reinforcement learning (RL) with adversarially perturbed state observations, which aligns with the setting of many adversarial attacks to deep reinforcement learning (DRL) and is also important for rolling out real-world RL agent under unpredictable sensing noise. With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found, which is guaranteed to obtain the worst case agent reward. For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones. To enhance the robustness of an agent, we propose a framework of alternating training with learned adversaries (ATLA), which trains an adversary online together with the agent using policy gradient following the optimal adversarial attack framework. Additionally, inspired by the analysis of state-adversarial Markov decision process (SA-MDP), we show that past states and actions (history) can be useful for learning a robust agent, and we empirically find a LSTM based policy can be more robust under adversaries. Empirical evaluations on a few continuous control environments show that ATLA achieves state-of-the-art performance under strong adversaries. Our code is available at https://github.com/huanzhang12/ATLA_robust_RL.
翻译:我们研究了强化学习(RL)的稳健性,并研究了与许多对抗性攻击的设置相匹配的强化学习(RL)的强性,与许多对抗性攻击的设置相匹配,与深强化学习(DRL)相匹配,对于在不可预测的感知噪音下推出真实世界的RL代理物剂也很重要。我们通过固定的代理物政策,我们证明可以找到一个最佳的干扰性国家观测的对手,保证获得最差的个案代理物赏。对于DRL的设置,这导致通过一个比以前强得多的学习性格对手对RLA代理人进行新颖的经验性对抗性攻击。为了加强一个代理人的稳健性,我们提议了一个与有经验的对手进行交替培训的框架(ATLA),在最佳的对抗性攻击框架之后,用政策梯度对对手进行在线培训。此外,根据对州-对抗性Markov决定程序的分析(SA-MDP),我们证明过去的州和行动(历史)对于学习强健健的代理人是有用的。我们的经验发现,基于LTM的政策在对手之下可以更稳健固的。