We study defense strategies against reward poisoning attacks in reinforcement learning. As a threat model, we consider attacks that minimally alter rewards to make the attacker's target policy uniquely optimal under the poisoned rewards, with the optimality gap specified by an attack parameter. Our goal is to design agents that are robust against such attacks in terms of the worst-case utility w.r.t. the true, unpoisoned, rewards while computing their policies under the poisoned rewards. We propose an optimization framework for deriving optimal defense policies, both when the attack parameter is known and unknown. Moreover, we show that defense policies that are solutions to the proposed optimization problems have provable performance guarantees. In particular, we provide the following bounds with respect to the true, unpoisoned, rewards: a) lower bounds on the expected return of the defense policies, and b) upper bounds on how suboptimal these defense policies are compared to the attacker's target policy. We conclude the paper by illustrating the intuitions behind our formal results, and showing that the derived bounds are non-trivial.
翻译:在强化学习中,我们研究防止中毒袭击的防御战略。作为一个威胁模型,我们考虑那些最微小地改变奖赏以使攻击者的目标政策在有毒的奖赏下达到最佳效果,并有攻击参数规定的最佳性差。我们的目标是设计能以最坏的功率(w.r.t.)来对付这种攻击的代理物,真实的、未受毒害的、奖励,同时根据有毒的奖赏来计算其政策。我们建议一个优化框架,以便在已知和未知的攻击参数时,形成最佳的防御政策。此外,我们表明,作为拟议优化问题的解决方案的防御政策有可调适的绩效保证。特别是,我们在真实的、未受毒害的奖励方面提供了以下界限:(a) 降低国防政策的预期回报的界限,以及(b) 将这些防御政策与攻击者的目标政策相比如何不理想的界限。我们通过说明我们正式结果背后的直觉,并通过显示衍生的界限不是三重的界限来完成这一论文。