Efficient exploration in multi-armed bandits is a fundamental online learning problem. In this work, we propose a variant of Thompson sampling that learns to explore better as it interacts with problem instances drawn from an unknown prior distribution. Our algorithm meta-learns the prior and thus we call it Meta-TS. We propose efficient implementations of Meta-TS and analyze it in Gaussian bandits. Our analysis shows the benefit of meta-learning the prior and is of a broader interest, because we derive the first prior-dependent upper bound on the Bayes regret of Thompson sampling. This result is complemented by empirical evaluation, which shows that Meta-TS quickly adapts to the unknown prior.
翻译:多武装强盗的有效探索是一个根本性的在线学习问题。 在这项工作中,我们提出了一个汤普森抽样的变种,它学会了更好的探索,因为它与从未知的先前分布中提取的问题实例相互作用。 我们的算法元learn the proglemental, 因此我们称之为Meta-TS。 我们建议高效地实施Meta-TS, 并在高山强盗中分析它。 我们的分析显示了元学习的好处。 我们的分析显示了之前的元学习的好处, 并且具有更广泛的利益, 因为我们从Thompson抽样中得出了第一个先前依赖Bayes的上层界限。 这一结果得到了经验评估的补充, 结果表明Meta-TS迅速适应了未知的前端。