优化政策,促进可实现的强力强化学习 (Policy Smoothing for Provably Robust Reinforcement Learning)

The study of provable adversarial robustness for deep neural networks (DNNs) has mainly focused on static supervised learning tasks such as image classification. However, DNNs have been used extensively in real-world adaptive tasks such as reinforcement learning (RL), making such systems vulnerable to adversarial attacks as well. Prior works in provable robustness in RL seek to certify the behaviour of the victim policy at every time-step against a non-adaptive adversary using methods developed for the static setting. But in the real world, an RL adversary can infer the defense strategy used by the victim agent by observing the states, actions, etc. from previous time-steps and adapt itself to produce stronger attacks in future steps. We present an efficient procedure, designed specifically to defend against an adaptive RL adversary, that can directly certify the total reward without requiring the policy to be robust at each time-step. Our main theoretical contribution is to prove an adaptive version of the Neyman-Pearson Lemma -- a key lemma for smoothing-based certificates -- where the adversarial perturbation at a particular time can be a stochastic function of current and previous observations and states as well as previous actions. Building on this result, we propose policy smoothing where the agent adds a Gaussian noise to its observation at each time-step before passing it through the policy function. Our robustness certificates guarantee that the final total reward obtained by policy smoothing remains above a certain threshold, even though the actions at intermediate time-steps may change under the attack. Our experiments on various environments like Cartpole, Pong, Freeway and Mountain Car show that our method can yield meaningful robustness guarantees in practice.

翻译：对深神经网络(DNNs)的可辨识对抗性强健性研究主要侧重于静态监管的学习任务,如图像分类。然而,DNNs在现实世界适应性任务中被广泛使用,如强化学习(RL),使这类系统也易受对抗性攻击的伤害。在RL的先前可辨识稳健性工作中,试图利用静态环境开发的方法,对非适应性对手的每一次步骤证明受害者政策的行为。但在现实世界中,RL敌国可以通过观察国家、行动等中间步骤来推断受害者代理人使用的防御战略。然而,DNLNNNNT(DNNNN)的防御性强势性研究主要用于证明受害者政策的行为。在特定时间里,我们提出了一种有效的程序,专门针对适应性RLRL的对手进行防御性攻击,这可以直接证明总奖赏,而无需在每一时刻里要求政策稳健。我们的主要理论贡献是证明尼曼-皮尔森·莱姆马(Neyman-Pearson Lemma)的适应式证书的适应性版本。一个关键的利姆马(Symod-roomma) 验证的精准,在特定时刻里,在最后观察状态下进行的对抗性检查中,在最后观察中可以显示最后的姿态,在最后观察中,在一次观察中可以显示我们以往政策的周期性实践上的稳性实践上显示我们以往政策的动作,在以往的姿态,在以往的动作,在一次的姿态上,在以往的姿态上,在最后的姿态上显示的动作中可以显示我们的政策性操作性操作性动作中可以显示我们以往的动作,在以往的稳性动作中可以保证。