We consider Thompson sampling for linear bandit problems with finitely many independent arms, where rewards are sampled from normal distributions that are linearly dependent on unknown parameter vectors and with unknown variance. Specifically, with a Bayesian formulation we consider multivariate normal-gamma priors to represent environment uncertainty for all involved parameters. We show that our chosen sampling prior is a conjugate prior to the reward model and derive a Bayesian regret bound for Thompson sampling under the condition that the 5/2-moment of the variance distribution exist.
翻译:我们认为,Thompson抽样调查的线性土匪问题有为数不多的独立武器,从通常的分布中抽取奖励,通常的分布线上依赖于未知的参数矢量,且差异不明。 具体地说,用一种巴伊西亚配方,我们认为多变的正常-伽玛前兆代表了所有相关参数的环境不确定性。 我们表明,我们所选择的先前采样是在奖励模型之前的假象,并得出贝伊西亚人对于汤普森取样的遗憾,条件是存在5/2差异分布的时速。</s>