In this work I study the problem of adversarial perturbations to rewards, in a Multi-armed bandit (MAB) setting. Specifically, I focus on an adversarial attack to a UCB type best-arm identification policy applied to a stochastic MAB. The UCB attack presented in [1] results in pulling a target arm K very often. I used the attack model of [1] to derive the sample complexity required for selecting target arm K as the best arm. I have proved that the stopping condition of UCB based best-arm identification algorithm given in [2], can be achieved by the target arm K in T rounds, where T depends only on the total number of arms and $\sigma$ parameter of $\sigma^2-$ sub-Gaussian random rewards of the arms.
翻译:在这项工作中,我在多武装匪徒(MAB)的设置中研究了对称干扰奖励的问题,具体地说,我侧重于对适用于机械型MAB的UCB型最佳武器识别政策进行对称攻击。[1]中介绍的UCB攻击导致经常拉动目标臂K。我使用[1]号攻击模型来得出选择目标臂K作为最佳武器所需的样本复杂性。我已经证明[2]中给出的基于最大武器识别算法的UCB停止状态可以由T回合中的目标臂K实现,T回合中的目标臂只取决于武器的总数和武器随机报酬的$$@sigma$2 Sub-Gausian参数。