We study the efficiency of Thompson sampling for contextual bandits. Existing Thompson sampling-based algorithms need to construct a Laplace approximation (i.e., a Gaussian distribution) of the posterior distribution, which is inefficient to sample in high dimensional applications for general covariance matrices. Moreover, the Gaussian approximation may not be a good surrogate for the posterior distribution for general reward generating functions. We propose an efficient posterior sampling algorithm, viz., Langevin Monte Carlo Thompson Sampling (LMC-TS), that uses Markov Chain Monte Carlo (MCMC) methods to directly sample from the posterior distribution in contextual bandits. Our method is computationally efficient since it only needs to perform noisy gradient descent updates without constructing the Laplace approximation of the posterior distribution. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits, viz., linear contextual bandits. We conduct experiments on both synthetic data and real-world datasets on different contextual bandit models, which demonstrates that directly sampling from the posterior is both computationally efficient and competitive in performance.
翻译:我们研究Thompson为背景强盗取样的效率。 现有的Thompson基于抽样的算法需要构建一个后部分布的拉普尔近似值(即高山分布),这对于在一般共变矩阵的高维应用中取样来说效率低下。 此外,高山近似值可能不是一般奖励产生功能的后部分布的好替代物。 我们建议一种高效的后部取样算法,即Langevin Monte Carlo Thompson Sampling(LMC-TS),该算法使用Markov 链 Monte Carlo(MC) 方法直接从背景强盗的后部分布中取样。 我们的方法在计算上是有效的,因为它只需要在不建立远端分布的Laplace近似的情况下进行吵闹的梯子下层更新。 我们证明,拟议的算法达到了与背景强盗(即线性强盗)特别案例的最佳汤普采算算法的亚线性遗憾。 我们对不同背景强盗模型进行合成数据和真实世界数据集的实验,这都表明从远地点直接取样既能又具有计算性。