Function approximation has enabled remarkable advances in applying reinforcement learning (RL) techniques in environments with high-dimensional inputs, such as images, in an end-to-end fashion, mapping such inputs directly to low-level control. Nevertheless, these have proved vulnerable to small adversarial input perturbations. A number of approaches for improving or certifying robustness of end-to-end RL to adversarial perturbations have emerged as a result, focusing on cumulative reward. However, what is often at stake in adversarial scenarios is the violation of fundamental properties, such as safety, rather than the overall reward that combines safety with efficiency. Moreover, properties such as safety can only be defined with respect to true state, rather than the high-dimensional raw inputs to end-to-end policies. To disentangle nominal efficiency and adversarial safety, we situate RL in deterministic partially-observable Markov decision processes (POMDPs) with the goal of maximizing cumulative reward subject to safety constraints. We then propose a partially-supervised reinforcement learning (PSRL) framework that takes advantage of an additional assumption that the true state of the POMDP is known at training time. We present the first approach for certifying safety of PSRL policies under adversarial input perturbations, and two adversarial training approaches that make direct use of PSRL. Our experiments demonstrate both the efficacy of the proposed approach for certifying safety in adversarial environments, and the value of the PSRL framework coupled with adversarial training in improving certified safety while preserving high nominal reward and high-quality predictions of true state.
翻译:功能近似使在高维投入(如图像)环境中应用强化学习技术取得了显著进展,例如以端到端的方式将这类投入直接映射到低级控制,但事实证明这些投入容易受到小规模对抗性投入干扰;因此出现了一些改进或核证端到端学习技术的稳健性的方法,重点是累积奖励;然而,在对抗性情景中,经常涉及的是违反基本属性,例如安全,而不是将安全与效率相结合的总体奖励;此外,安全等属性只能根据真实状态来界定,而不是根据对端到端政策的高维度原始投入。为了分散名义效率和对抗性安全性,我们把RL放在确定性部分可观测的Markov决策过程(POMDPs),目标是在安全受限制的情况下最大限度地累积奖励;我们然后提议一个部分超额强化学习(PSRL)框架,这一框架首先利用了一种额外假设,即对真实状态的预测值,而不是对端对端至端政策的高度原始原始投入进行高端性投入,同时在目前对POM-DP进行真正的培训时,我们所了解的保面性安全性培训的正确程度。