Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables. Since the hierarchy can have multiple layers, we call it deep. We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem, and show how to implement it efficiently for Gaussian hierarchies. The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest. We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits. Our analysis reflects the structure of the problem, that the regret decreases with the prior width, and also shows that hierarchies reduce the regret by non-constant factors in the number of actions. We confirm these theoretical findings empirically, in both synthetic and real-world experiments.
翻译:这些关联的形式可能是复杂和未知的,例如用户对推荐的产品及其类别所偏好。为了最大限度地提高统计效率,重要的是在学习时利用这些关联。我们制定了这一问题的匪徒变式,其中中下等行动奖项的相互关系代表着一种有潜伏变数的贝叶斯等级模型。由于等级分级,我们称之为“深层次”。我们为这一问题建议了一种分级的汤普森抽样算法(HierTS),并展示了如何有效地为高斯等级国家实施该算法(HierTS) 。高效实施之所以可能,是因为后等国家具有新的确切的等级代表性,而后等国家本身是独立感兴趣的。我们用这个精确的外行来分析高斯强盗的Bayes的遗憾。我们的分析反映了问题的结构,即遗憾会随着先前的宽度而减少,并且还表明,在行动的数量中,高斯等级因素减少了遗憾。我们在合成和现实世界的实验中都证实了这些理论结论。