We consider a stochastic multi-arm bandit problem where rewards are subject to adversarial corruption. We propose a novel attack strategy that manipulates a UCB principle into pulling some non-optimal target arm $T - o(T)$ times with a cumulative cost that scales as $\sqrt{\log T}$, where $T$ is the number of rounds. We also prove the first lower bound on the cumulative attack cost. Our lower bound matches our upper bound up to $\log \log T$ factors, showing our attack to be near optimal.
翻译:我们考虑一种随机的多臂赌博机问题,其中奖励遭到对抗性攻击。我们提出了一种新颖的攻击策略,将UCB算法引导到多次拉取某个非最优的目标臂,攻击累积成本的规模为$ \sqrt{\log T}$,其中$T$是回合数。我们还证明了攻击的累积成本的第一个下界,该下界与我们的上界相匹配,误差仅为$ \log \log T$。这显示出我们的攻击是近乎最优的。