To understand the security threats to reinforcement learning (RL) algorithms, this paper studies poisoning attacks to manipulate \emph{any} order-optimal learning algorithm towards a targeted policy in episodic RL and examines the potential damage of two natural types of poisoning attacks, i.e., the manipulation of \emph{reward} and \emph{action}. We discover that the effect of attacks crucially depend on whether the rewards are bounded or unbounded. In bounded reward settings, we show that only reward manipulation or only action manipulation cannot guarantee a successful attack. However, by combining reward and action manipulation, the adversary can manipulate any order-optimal learning algorithm to follow any targeted policy with $\tilde{\Theta}(\sqrt{T})$ total attack cost, which is order-optimal, without any knowledge of the underlying MDP. In contrast, in unbounded reward settings, we show that reward manipulation attacks are sufficient for an adversary to successfully manipulate any order-optimal learning algorithm to follow any targeted policy using $\tilde{O}(\sqrt{T})$ amount of contamination. Our results reveal useful insights about what can or cannot be achieved by poisoning attacks, and are set to spur more works on the design of robust RL algorithms.
翻译:为了理解对强化学习(RL)算法的安全威胁,本文研究对袭击的毒害性威胁,以操纵 emph{reward} 和\emph{action} 来控制对强化学习(RL) 算法的安全威胁。为了理解对强化学习(RL) 算法的安全威胁,本文研究对袭击的毒害性威胁,以操纵 \ emph{ anny} 秩序优化的学习算法, 以此来对 Associal RLLL(\\ qrt{T}) 的定向政策进行操纵, 并考察两种自然的中毒攻击性攻击, 即操纵\ emph{resward} 和\ emphem{a{ action} 。 我们发现, 攻击的效果主要取决于奖赏是否受约束。 在受约束的奖赏环境中, 我们显示, 奖赏性攻击的对手足以成功地操纵任何秩序优化学习算法, 来遵循任何目标政策, 使用 $tilde{O} 来操纵任何命令优化的学习算法, 遵循任何目标性的政策。