The stochastic multi-arm bandit problem has been extensively studied under standard assumptions on the arm's distribution (e.g bounded with known support, exponential family, etc). These assumptions are suitable for many real-world problems but sometimes they require knowledge (on tails for instance) that may not be precisely accessible to the practitioner, raising the question of the robustness of bandit algorithms to model misspecification. In this paper we study a generic Dirichlet Sampling (DS) algorithm, based on pairwise comparisons of empirical indices computed with re-sampling of the arms' observations and a data-dependent exploration bonus. We show that different variants of this strategy achieve provably optimal regret guarantees when the distributions are bounded and logarithmic regret for semi-bounded distributions with a mild quantile condition. We also show that a simple tuning achieve robustness with respect to a large class of unbounded distributions, at the cost of slightly worse than logarithmic asymptotic regret. We finally provide numerical experiments showing the merits of DS in a decision-making problem on synthetic agriculture data.
翻译:根据对手臂分布的标准假设(例如有已知支持、指数式家庭等),对多臂强盗问题进行了广泛研究。这些假设适用于许多现实世界问题,但有时需要知识(例如尾巴),而实践者可能无法准确获得这种知识,这就提出了强盗算法的稳健性问题,以模型错误区分。在本文中,我们研究了一种通用的Drichlet抽样算法,这种算法基于对经验性指数的对比,这种指数是用重新采集手臂观察结果和数据依据的勘探奖金来计算的。我们表明,当分布被捆绑和对准的半边际分布有轻微微小的微度条件时,这一战略的不同变式都取得了可想象的最佳遗憾保证。我们还表明,简单调整对于大类无限制的分布取得了稳健性,其代价略小于对正数的误差。我们最后提供了数字实验,表明DS在合成农业数据的决策问题上的优点。