In this note, we introduce a general version of the well-known elliptical potential lemma that is a widely used technique in the analysis of algorithms in sequential learning and decision-making problems. We consider a stochastic linear bandit setting where a decision-maker sequentially chooses among a set of given actions, observes their noisy rewards, and aims to maximize her cumulative expected reward over a decision-making horizon. The elliptical potential lemma is a key tool for quantifying uncertainty in estimating parameters of the reward function, but it requires the noise and the prior distributions to be Gaussian. Our general elliptical potential lemma relaxes this Gaussian requirement which is a highly non-trivial extension for a number of reasons; unlike the Gaussian case, there is no closed-form solution for the covariance matrix of the posterior distribution, the covariance matrix is not a deterministic function of the actions, and the covariance matrix is not decreasing with respect to the semidefinite inequality. While this result is of broad interest, we showcase an application of it to prove an improved Bayesian regret bound for the well-known Thompson sampling algorithm in stochastic linear bandits with changing action sets where prior and noise distributions are general. This bound is minimax optimal up to constants.
翻译:在本说明中,我们引入了众所周知的椭圆潜力的普通版本,这是分析连续学习和决策问题的算法时广泛使用的一种技术。我们考虑一种随机的线性线性匪帮设置,在这个设置中,决策者按顺序选择一系列特定行动,观察其吵闹的奖励,目的是在决策视野中最大限度地增加其累积的预期奖赏。椭圆潜力是估算奖励功能参数时量化不确定性的关键工具,但它需要噪音和先前的分布才能成为高斯。我们一般的椭圆潜力缓解了高斯的要求,这是出于一些原因,高度非三重扩展;与高斯案例不同,后方分布的变异性矩阵没有封闭式解决办法,共变异矩阵并不是行动的一种威慑性功能,而对于半定型不平等则没有减少。尽管这一结果具有广泛的兴趣,但我们展示了高端利贷的这一要求,这是高度非三重的扩展;与高端分配相比,对于后方分布的组合,共变形矩阵并不是行动的一种决定性功能,而共变式矩阵在半定不平等方面并没有减少。在这种结果中,我们展示了广泛的兴趣,我们展示了在普通的基质的基质分析中式矩阵应用了它之前的模型,从而可以证明其改进了普通的模型。