This paper studies regret minimization in multi-armed bandits, a classical online learning problem. To develop more statistically-efficient algorithms, we propose to use the assumption of a random-effect model. In this model, the mean rewards of arms are drawn independently from an unknown distribution, whose parameters we estimate. We provide an estimator of the arm means in this model and also analyze its uncertainty. Based on these results, we design a UCB algorithm, which we call ReUCB. We analyze ReUCB and prove a Bayes regret bound on its $n$-round regret, which matches an existing lower bound. Our experiments show that ReUCB can outperform Thompson sampling in various scenarios, without assuming that the prior distribution of arm means is known.
翻译:本文研究对多武装强盗最小化的遗憾,这是一个典型的在线学习问题。 为了开发更具有统计效率的算法, 我们建议使用随机效应模型的假设。 在这个模型中, 武器的平均回报是独立于未知分布的, 我们估计其参数。 我们提供这个模型中手臂手段的估算器, 并分析其不确定性。 基于这些结果, 我们设计了一个UCB算法, 我们称之为ReUCB。 我们分析ReUCB, 并证明Bayes对于其一连串的遗憾感到后悔, 这与现有的较低约束值相匹配。 我们的实验显示, REUCB 可以在各种情况下超越Thompson的抽样, 而不假定知道先前的手臂分布 。