Efficient exploration in bandits is a fundamental online learning problem. We propose a variant of Thompson sampling that learns to explore better as it interacts with bandit instances drawn from an unknown prior. The algorithm meta-learns the prior and thus we call it MetaTS. We propose several efficient implementations of MetaTS and analyze it in Gaussian bandits. Our analysis shows the benefit of meta-learning and is of a broader interest, because we derive a novel prior-dependent Bayes regret bound for Thompson sampling. Our theory is complemented by empirical evaluation, which shows that MetaTS quickly adapts to the unknown prior.
翻译:对强盗的有效探索是一个基本的在线学习问题。 我们提出了一个汤普森抽样的变种, 以学习如何更好地探索, 因为它与从未知的先前的强盗事件相互作用。 算法的元分解, 因此我们称之为MetATS。 我们提出若干有效的实施MetATS, 并在高山强盗中分析它。 我们的分析显示了元学习的好处, 并且具有更广泛的兴趣, 因为我们从一种新颖的、 先前依赖的贝耶斯人身上 得出了对汤普森取样的遗憾。 我们的理论得到了经验评估的补充, 经验评估表明MetATS迅速适应了未知的以前。