We study a security threat to adversarial multi-armed bandits, in which an attacker perturbs the loss or reward signal to control the behavior of the victim bandit player. We show that the attacker is able to mislead any no-regret adversarial bandit algorithm into selecting a suboptimal target arm in every but sublinear (T-o(T)) number of rounds, while incurring only sublinear (o(T)) cumulative attack cost. This result implies critical security concern in real-world bandit-based systems, e.g., in online recommendation, an attacker might be able to hijack the recommender system and promote a desired product. Our proposed attack algorithms require knowledge of only the regret rate, thus are agnostic to the concrete bandit algorithm employed by the victim player. We also derived a theoretical lower bound on the cumulative attack cost that any victim-agnostic attack algorithm must incur. The lower bound matches the upper bound achieved by our attack, which shows that our attack is asymptotically optimal.
翻译:我们研究对对抗性多武装匪徒的安全威胁,攻击者在这种威胁中干扰丢失或奖励信号以控制受害者土匪行为。我们显示攻击者能够误导任何无雷敌对匪帮的算法,在每一枚亚线性子弹(T-o(T))数目中选择一个亚最佳目标臂,同时只产生亚线性攻击累积成本。这一结果意味着现实世界强盗系统的重大安全关切,例如在线建议中,攻击者可能能够劫持推荐者系统并推销理想产品。我们提出的攻击算法只要求了解遗憾率,因此对受害者玩家所使用的具体土匪算法是不可知的。我们还从任何受害者-敏感攻击算法都必须承担的累积攻击成本中得出了一个较低的理论约束。更低的界限与我们攻击所达到的上限相符,这表明我们的攻击是尽可能最佳的。