基于询问的有针对性行动 -- -- 空间对深强化学习代理人的反空间政策 (Query-based Targeted Action-Space Adversarial Policies on Deep Reinforcement Learning Agents)

Advances in computing resources have resulted in the increasing complexity of cyber-physical systems (CPS). As the complexity of CPS evolved, the focus has shifted from traditional control methods to deep reinforcement learning-based (DRL) methods for control of these systems. This is due to the difficulty of obtaining accurate models of complex CPS for traditional control. However, to securely deploy DRL in production, it is essential to examine the weaknesses of DRL-based controllers (policies) towards malicious attacks from all angles. In this work, we investigate targeted attacks in the action-space domain, also commonly known as actuation attacks in CPS literature, which perturbs the outputs of a controller. We show that a query-based black-box attack model that generates optimal perturbations with respect to an adversarial goal can be formulated as another reinforcement learning problem. Thus, such an adversarial policy can be trained using conventional DRL methods. Experimental results showed that adversarial policies that only observe the nominal policy's output generate stronger attacks than adversarial policies that observe the nominal policy's input and output. Further analysis reveals that nominal policies whose outputs are frequently at the boundaries of the action space are naturally more robust towards adversarial policies. Lastly, we propose the use of adversarial training with transfer learning to induce robust behaviors into the nominal policy, which decreases the rate of successful targeted attacks by 50%.

翻译：计算机资源的进步导致计算机物理系统(CPS)日益复杂。随着CPS的复杂性的演变,重点已从传统的控制方法转向深入强化学习(DRL)控制这些系统的方法。这是因为很难获得复杂CPS的精确模型,难以获得用于传统控制的复杂CPS。然而,为了在生产过程中安全地部署DRL,必须检查基于DRL的控制器(政策)对各种角度的恶意攻击的弱点。在这项工作中,我们调查行动空间领域的定点攻击,也通常称为CPS文献中的触发攻击,这渗透着控制器的输出。我们显示,基于查询的黑箱攻击模式,在对立目标方面产生最佳的干扰,作为另一个强化学习问题,这种对抗性政策可以用传统的DRL方法进行培训。实验结果显示,只遵守名义政策产出的对抗性政策比观察标称政策的投入和产出的对抗性政策更强烈的攻击。进一步分析显示,基于标称性政策的标称性政策向50项目标攻击的激烈性政策转变,我们往往通过将标称性政策转变为标称性政策,而其标称性政策在最后阶段,我们通常会学习将示范性攻击率的反射率。