Deep reinforcement learning (DRL) has made significant achievements in many real-world applications. But these real-world applications typically can only provide partial observations for making decisions due to occlusions and noisy sensors. However, partial state observability can be used to hide malicious behaviors for backdoors. In this paper, we explore the sequential nature of DRL and propose a novel temporal-pattern backdoor attack to DRL, whose trigger is a set of temporal constraints on a sequence of observations rather than a single observation, and effect can be kept in a controllable duration rather than in the instant. We validate our proposed backdoor attack to a typical job scheduling task in cloud computing. Numerous experimental results show that our backdoor can achieve excellent effectiveness, stealthiness, and sustainability. Our backdoor's average clean data accuracy and attack success rate can reach 97.8% and 97.5%, respectively.
翻译:深入强化学习( DRL) 在许多现实世界应用中取得了显著成就。 但是这些现实世界应用通常只能提供部分观测, 以便通过隔离和噪音传感器来做出决策。 但是, 部分状态的可观察性可以用来隐藏后门的恶意行为 。 在本文中, 我们探索 DRL 的相继性质, 并向 DRL 提出一个新的时间模式后门攻击, 触发点是一系列时间限制, 对一系列观察进行时间限制, 而不是单一观察, 效果可以保持在可控制的时间里, 而不是在瞬间。 我们验证了我们提议的后门攻击, 而不是云计算中典型的工作时间安排任务。 许多实验结果显示, 我们的后门可以达到极好的效果、 隐秘性和可持续性。 我们的后门平均清洁数据准确度和攻击成功率可以分别达到97.8% 和 97.5% 。