A contextual bandit is a popular framework for online learning to act under uncertainty. In practice, the number of actions is huge and their expected rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a mixed-effect model where actions are related through multiple shared effect parameters. To explore efficiently using this structure, we propose Mixed-Effect Thompson Sampling (meTS) and bound its Bayes regret. The regret bound has two terms, one for learning the action parameters and the other for learning the shared effect parameters. The terms reflect the structure of our model and the quality of priors. Our theoretical findings are validated empirically using both synthetic and real-world problems. We also propose numerous extensions of practical interest. While they do not come with guarantees, they perform well empirically and show the generality of the proposed framework.
翻译:在实际操作中,行动的数量是巨大的,预期的回报是相互关联的。在这项工作中,我们引入了一个总的框架,通过混合效应模型捕捉这种相互关系,因为行动是通过多重共同效应参数联系在一起的。为了有效地利用这一结构,我们提议采用混合效应汤普森抽样(meThompson Sampling),并约束其贝叶斯的遗憾。遗憾捆绑有两个条件,一个是学习行动参数,另一个是学习共同效应参数。术语反映了我们模型的结构和前科的质量。我们的理论发现通过合成和实际世界的问题得到经验验证。我们还提出了许多实际利益的延伸。虽然它们没有保证,但它们表现良好,并展示了拟议框架的一般性。</s>