We present a policy optimization framework in which the learned policy comes with a machine-checkable certificate of adversarial robustness. Our approach, called CAROL, learns a model of the environment. In each learning iteration, it uses the current version of this model and an external abstract interpreter to construct a differentiable signal for provable robustness. This signal is used to guide policy learning, and the abstract interpretation used to construct it directly leads to the robustness certificate returned at convergence. We give a theoretical analysis that bounds the worst-case accumulative reward of CAROL. We also experimentally evaluate CAROL on four MuJoCo environments. On these tasks, which involve continuous state and action spaces, CAROL learns certified policies that have performance comparable to the (non-certified) policies learned using state-of-the-art robust RL methods.
翻译:我们提出了一个政策优化框架,让所学政策产生一个机器可检验的对抗性强力证书。我们的方法称为CAROL, 学习环境模型。在每次学习迭代中,它使用目前版本的这个模型和一个外部抽象解释器来构建一个可辨别稳健性的不同信号。这个信号被用来指导政策学习,而用来构建它的抽象解释直接导致在趋同时返回的稳健性证书。我们做了一个理论分析,将CAROL的最坏情况累积性奖赏捆绑在一起。我们还实验性地评估了四个 MuJoCo环境的CAROL。在这些任务中,涉及持续的州和行动空间,CAROL学习了与使用最先进的稳健RL方法所学到的(未经认证的)政策相类似的经认证的政策。