This work addresses a version of the two-armed Bernoulli bandit problem where the sum of the means of the arms is one (the symmetric two-armed Bernoulli bandit). In a regime where the gap between these means goes to zero and the number of prediction periods approaches infinity, we obtain the leading order terms of the expected regret and pseudoregret for this problem by associating each of them with a solution of a linear parabolic partial differential equation. Our results improve upon the previously known results; specifically, we explicitly compute the leading order term of the optimal regret and pseudoregret in three different scaling regimes for the gap. Additionally, we obtain new non-asymptotic bounds for any given time horizon.
翻译:这项工作解决了双臂伯努利土匪问题的一个版本,即武器手段的总和是一个(对称双臂伯努利土匪 ) 。 在这两个手段之间的差距达到零和预测期数接近无限的政权中,我们获得了这一问题预期遗憾和假象的主要顺序条件,将其中每个人与线性抛物线部分差别方程式的解决方案联系起来。我们的结果比先前已知的结果有所改进;具体地说,我们明确计算了最佳遗憾和假冒雷布雷特在三种不同差距缩放制度中的主要顺序。此外,我们获得了任何特定时间跨度的新的非救济界限。