A contextual bandit is a popular and practical framework for online learning to act under uncertainty. In many problems, the number of actions is huge and their mean rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a two-level graphical model where actions are related through multiple shared latent parameters. We propose a Thompson sampling algorithm G-HierTS that uses this structure to explore efficiently and bound its Bayes regret. The regret has two terms, one for learning action parameters and the other for learning the shared latent parameters. The terms reflect the structure of our model as well as the quality of priors. Our theoretical findings are validated empirically using both synthetic and real-world problems. We also experiment with G-HierTS that maintains a factored posterior over latent parameters. While this approximation does not come with guarantees, it improves computational efficiency with a minimal impact on empirical regret.
翻译:上下文的土匪是在线学习在不确定情况下采取行动的广受欢迎的实用框架。 在许多问题上, 行动的数量是巨大的, 其平均回报是相互关联的。 在这项工作中, 我们引入一个总框架, 通过两个层次的图形模型来捕捉这些关联, 行动通过多个共享的潜在参数是相关的。 我们提议了一个汤普森抽样算法 G- HierTS, 使用这个结构来有效探索和约束贝叶斯的遗憾。 遗憾有两个条件, 一个是学习行动参数, 另一个是学习共同的潜在参数。 术语反映了我们模型的结构以及前科的质量。 我们的理论发现是用合成问题和现实世界的问题来验证经验性的。 我们还试验G- HierTS, 以维持一个因子随身而超越潜在参数的系数。 虽然这种近似并不带来保证, 它提高了计算效率, 对经验后悔影响最小。