通过有限数据核查,从毒物袭击中拯救牲畜 (Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification)

We study bandit algorithms under data poisoning attacks in a bounded reward setting. We consider a strong attacker model in which the attacker can observe both the selected actions and their corresponding rewards and can contaminate the rewards with additive noise. We show that any bandit algorithm with regret $O(\log T)$ can be forced to suffer a regret $\Omega(T)$ with an expected amount of contamination $O(\log T)$. This amount of contamination is also necessary, as we prove that there exists an $O(\log T)$ regret bandit algorithm, specifically the classical UCB, that requires $\Omega(\log T)$ amount of contamination to suffer regret $\Omega(T)$. To combat such attacks, our second main contribution is to propose verification based mechanisms, which use limited verification to access a limited number of uncontaminated rewards. In particular, for the case of unlimited verifications, we show that with $O(\log T)$ expected number of verifications, a simple modified version of the ETC type bandit algorithm can restore the order optimal $O(\log T)$ regret irrespective of the amount of contamination used by the attacker. We also provide a UCB-like verification scheme, called Secure-UCB, that also enjoys full recovery from any attacks, also with $O(\log T)$ expected number of verifications. To derive a matching lower bound on the number of verifications, we prove that for any order-optimal bandit algorithm, this number of verifications $\Omega(\log T)$ is necessary to recover the order-optimal regret. On the other hand, when the number of verifications is bounded above by a budget $B$, we propose a novel algorithm, Secure-BARBAR, which provably achieves $O(\min\{C,T/\sqrt{B} \})$ regret with high probability against weak attackers where $C$ is the total amount of contamination by the attacker, which breaks the known $\Omega(C)$ lower bound of the non-verified setting if $C$ is large.

翻译：我们在受约束的奖赏环境下, 在数据中毒攻击中研究土匪算法。我们考虑一个强大的攻击者模型, 攻击者可以在其中观察选定的行动及其相应的奖赏, 并且能够用添加的噪音污染奖励。我们显示任何对O(\log T) 的遗憾的土匪算法, 可能会被迫遭受遗憾 $\ Omega( T) 美元, 并且预计污染会达到一定数量的O( log T) 。这种污染程度也是必要的, 美元( log T) (log T) 的遗憾算法, 典型的UC, 需要美元( log) 美元( T) 和美元( log) 。为了打击这种攻击, 我们的污染数量会恢复最理想的 $( O) 。