We lay the foundations of a non-parametric theory of best-arm identification in multi-armed bandits with a fixed budget T. We consider general, possibly non-parametric, models D for distributions over the arms; an overarching example is the model D = P(0,1) of all probability distributions over [0,1]. We propose upper bounds on the average log-probability of misidentifying the optimal arm based on information-theoretic quantities that correspond to infima over Kullback-Leibler divergences between some distributions in D and a given distribution. This is made possible by a refined analysis of the successive-rejects strategy of Audibert, Bubeck, and Munos (2010). We finally provide lower bounds on the same average log-probability, also in terms of the same new information-theoretic quantities; these lower bounds are larger when the (natural) assumptions on the considered strategies are stronger. All these new upper and lower bounds generalize existing bounds based, e.g., on gaps between distributions.
翻译:我们为具有固定预算的多武装匪徒的最佳武器识别方法的非参数理论打下了基础。 我们考虑通用的、可能非参数的D型模型,用于武器上分布;一个总括实例是所有概率分布超过[0,1]的模型D=P(0,1),一个总括实例是所有概率分布都大于[0,1]的模型D=P(0,1),我们提议基于信息-理论数量对准Kullback-Leibler对D中某些分布和给定分布之间的异差的信息-理论性平均对准最佳手臂的错误识别的上限值。这是通过对Audibert、Bubeck和Munos(2010年)相继投射目标战略的精细分析而得以实现的。我们最后提供了相同平均逻辑概率的下限值,也是以相同的新信息-理论数量为参照战略的(自然)假设值更强时,这些下限值更大。所有这些新的上限和下限均基于,例如分布之间的差。