In this paper we propose a general methodology to derive regret bounds for randomized multi-armed bandit algorithms. It consists in checking a set of sufficient conditions on the sampling probability of each arm and on the family of distributions to prove a logarithmic regret. As a direct application we revisit two famous bandit algorithms, Minimum Empirical Divergence (MED) and Thompson Sampling (TS), under various models for the distributions including single parameter exponential families, Gaussian distributions, bounded distributions, or distributions satisfying some conditions on their moments. In particular, we prove that MED is asymptotically optimal for all these models, but also provide a simple regret analysis of some TS algorithms for which the optimality is already known. We then further illustrate the interest of our approach, by analyzing a new Non-Parametric TS algorithm (h-NPTS), adapted to some families of unbounded reward distributions with a bounded h-moment. This model can for instance capture some non-parametric families of distributions whose variance is upper bounded by a known constant.
翻译:在本文中,我们提出了一个一般方法,为随机多武装土匪算法引出遗憾。它包括检查每只手臂抽样概率和分布式家族的一套足够条件,以证明对数遗憾。作为直接应用,我们重新审视了两种著名的土匪算法,即最低经验差异(MED)和汤普森抽样(TS),根据各种分配模式,包括单一参数指数式家庭、高山分布、捆绑分布或符合其时刻某些条件的分布。特别是,我们证明MED对于所有这些模型来说都不太理想,但也对一些已经知道最佳性的一些TS算法提供了简单的遗憾分析。我们随后又通过分析一种新的非参数 TS算法(h-NPTS)来进一步说明我们的方法的兴趣,这种算法适用于一些没有约束性奖分分配的家族,并带有约束性的hmoment。这个模型可以举例地捕捉出一些非参数式的分布式家庭,其差异由已知的常数限制。</s>