A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises. Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions. Several works have shown this vulnerability via adversarial attacks, but existing approaches on improving the robustness of DRL under this setting have limited success and lack for theoretical principles. We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks. We propose the state-adversarial Markov decision process (SA-MDP) to study the fundamental properties of this problem, and develop a theoretically principled policy regularization which can be applied to a large family of DRL algorithms, including proximal policy optimization (PPO), deep deterministic policy gradient (DDPG) and deep Q networks (DQN), for both discrete and continuous action control problems. We significantly improve the robustness of PPO, DDPG and DQN agents under a suite of strong white box adversarial attacks, including new attacks of our own. Additionally, we find that a robust policy noticeably improves DRL performance even without an adversary in a number of environments. Our code is available at https://github.com/chenhongge/StateAdvDRL.
翻译:深度强化学习(DRL)代理机构通过观察观察来观察其状态,其中可能含有自然测量错误或对抗性噪音;由于观察偏离了真实状态,它们可以误导代理人采取次优行动;一些工作通过对抗性攻击表明了这种脆弱性,但在此背景下提高DRL稳健性的现有方法有限,缺乏理论原则;我们表明,天真地运用现有技术,提高分类任务的稳健性,如对抗性培训,对于许多RL任务来说是无效的。我们提议州-敌对的Markov 决策程序(SA-MDP)研究这一问题的基本性质,并制订具有理论原则性的政策规范化,可以适用于大型DRL算法,包括准政策优化、深度确定性政策梯度和深度Q网络(DQN),对于分散的和持续的行动控制问题,我们发现,对PPO、DDPG和DQN 代理机构在一套强力的白箱对抗性攻击下,包括我们自己的新攻击,我们发现一个强有力的政策规范性规范。此外,我们发现一个强有力的政策性DRDR/DR在不具有可预见性的环境。