We consider the fixed-budget best arm identification problem in the multi-armed bandit problem. One of the main interests in this field is to derive a tight lower bound on the probability of misidentifying the best arm and to develop a strategy whose performance guarantee matches the lower bound. However, it has long been an open problem when the optimal allocation ratio of arm draws is unknown. In this paper, we provide an answer for this problem under which the gap between the expected rewards is small. First, we derive a tight problem-dependent lower bound, which characterizes the optimal allocation ratio that depends on the gap of the expected rewards and the Fisher information of the bandit model. Then, we propose the "RS-AIPW" strategy, which consists of the randomized sampling (RS) rule using the estimated optimal allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed strategy is optimal in the sense that the performance guarantee achieves the derived lower bound under a small gap. In the course of the analysis, we present a novel large deviation bound for martingales.
翻译:我们考虑的是多武装匪徒问题中固定预算最佳手臂识别问题。这个领域的主要利益之一是对确定最佳手臂的误差概率进行严格较低的限制,并制定一项业绩保证与较低约束相符的战略。然而,当武器抽取的最佳分配比率未知时,这个问题长期以来就是一个未解决的问题。在本文中,我们为预期报酬之间的差距小的问题提供了一个答案。首先,我们得出了一个紧紧的问题依赖较低约束,这是最佳分配比率的特点,而最佳分配比率取决于预期报酬的差距和强盗模式的渔业信息。然后,我们提出“RS-AIPW”战略,它由随机抽样抽样规则(RS)组成,使用估计的最佳分配比率以及建议规则,使用增加的相反概率加权(AIPW)估计值。我们提出的战略是最佳的,因为业绩保证在很小的缺口下获得较低约束。在分析过程中,我们提出了一个新的大偏差。