This paper investigates the problem of best arm identification in $\textit{contaminated}$ stochastic multi-arm bandits. In this setting, the rewards obtained from any arm are replaced by samples from an adversarial model with probability $\varepsilon$. A fixed confidence (infinite-horizon) setting is considered, where the goal of the learner is to identify the arm with the largest mean. Owing to the adversarial contamination of the rewards, each arm's mean is only partially identifiable. This paper proposes two algorithms, a gap-based algorithm and one based on the successive elimination, for best arm identification in sub-Gaussian bandits. These algorithms involve mean estimates that achieve the optimal error guarantee on the deviation of the true mean from the estimate asymptotically. Furthermore, these algorithms asymptotically achieve the optimal sample complexity. Specifically, for the gap-based algorithm, the sample complexity is asymptotically optimal up to constant factors, while for the successive elimination-based algorithm, it is optimal up to logarithmic factors. Finally, numerical experiments are provided to illustrate the gains of the algorithms compared to the existing baselines.
翻译:本文调查了$\ textit{ 污染了$ stochistic 多重武器匪徒中最佳手臂识别问题。 在此背景下, 从任何手臂获得的奖赏都由对抗模型样本替换, 概率为$\varepsilon。 考虑固定信心( 无限- horizon) 设置, 学习者的目标是用最大平均值来识别手臂。 由于奖赏的对抗性污染, 每一臂的平均值只能部分地识别。 本文提出了两种算法, 一种基于差距的算法, 一种基于连续消除法, 一种基于差距的算法, 一种基于连续消除法,, 以亚高加索土匪中最佳手臂识别法为基础。 这些算法包含平均估计值, 以最佳的错误来保证真实平均值与估算值的偏差。 此外, 这些算法无法同时达到最佳的样本复杂性。 具体地说, 就基于差距的算法而言, 样本复杂度与不变因素一样,, 在连续消除法算法的算法中, 它最符合逻辑因素。 最后, 提供数字实验是为了说明现有算算算算算算的利的结果。