This paper studies the stochastic linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in $\mathbb{R}^d$ and receives noisy rewards. The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of $T$ decisions. Linear Thompson Sampling (LinTS) is a popular Bayesian heuristic, supported by theoretical analysis that shows its Bayesian regret is bounded by $\widetilde{\mathcal{O}}(d\sqrt{T})$, matching minimax lower bounds. However, previous studies demonstrate that the frequentist regret bound for LinTS is $\widetilde{\mathcal{O}}(d\sqrt{dT})$, which requires posterior variance inflation and is by a factor of $\sqrt{d}$ worse than the best optimism-based algorithms. We prove that this inflation is fundamental and that the frequentist bound of $\widetilde{\mathcal{O}}(d\sqrt{dT})$ is the best possible, by demonstrating a randomization bias phenomenon in LinTS that can cause linear regret without inflation.We propose a data-driven version of LinTS that adjusts posterior inflation using observed data, which can achieve minimax optimal frequentist regret, under additional conditions. Our analysis provides new insights into LinTS and settles an open problem in the field.
翻译:本文研究随机线性赌博机问题,其中决策者从可能与时间相关的向量集合中选择动作,并获得噪声奖励。目标是在一系列 $T$ 决策中将后悔最小化,这里后悔被定义为决策者的累计期望奖励与具有每个动作的期望奖励访问权限的预言家之间的差异。线性汤普森采样(LinTS)是一种流行的贝叶斯启发式算法,并且在其支持的理论分析中,它的贝叶斯后悔由 $\widetilde{\mathcal{O}}(d\sqrt{T})$ 限制,与极小值下限相匹配。然而,先前的研究表明,LinTS 的频率主义后悔界限为 $\widetilde{\mathcal{O}}(d\sqrt{dT})$,需要后验方差膨胀,并且是最佳乐观性算法的 $\sqrt{d}$ 倍。我们证明了膨胀是根本的,并且以下两个相反的论点是正确的:在不膨胀的情况下,LinTS 的线性后悔是可能的;LinTS 的频率主义界限 $\widetilde{\mathcal{O}}(d\sqrt{dT})$ 是最佳的。我们提出了一个数据驱动的版本 LinTS,利用观测数据来调整后验膨胀,能够在额外条件下实现极小值最优的频率主义后悔。我们的分析为 LinTS 提供了新的见解,并解决了该领域的一个问题。