A contextual bandit is a popular framework for online learning to act under uncertainty. In practice, the number of actions is huge and their expected rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a mixed-effect model where actions are related through multiple shared effect parameters. We propose Mixed-Effect Thompson Sampling (meTS) that uses this structure to explore efficiently and bound its Bayes regret. The regret bound has two terms, one for learning the action parameters and the other for learning the shared effect parameters. The terms reflect the structure of our model and the quality of priors. Our theoretical findings are validated empirically using both synthetic and real-world problems. We also propose numerous extensions of practical interest. While they do not come with guarantees, they perform extremely well empirically and show the generality of the proposed framework.
翻译:在实际操作中,行动的数量是巨大的,预期的回报是相互关联的。在这项工作中,我们引入了一个总框架,通过混合效应模式捕捉这种相互关系,因为行动是通过多重共同效应参数联系在一起的。我们建议采用混合效应Thompson抽样(meThompson Smpling)这一结构来有效探索并约束其贝叶斯的遗憾。遗憾约束有两个条件,一个是学习行动参数,另一个是学习共同效应参数。术语反映了我们模型的结构和前科的质量。我们的理论发现通过合成问题和实际世界问题进行经验验证。我们还提出了许多实际利益延伸。虽然它们没有保证,但它们表现非常丰富,并显示了拟议框架的普遍性。