Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. However, there is a large discrepancy between the superior practical performance of these approaches and their theoretical justification. Previous research only indicates a negative theoretical result: Thompson sampling could have a worst-case linear regret $\Omega(T)$ with a constant threshold on the inference error measured by one $\alpha$-divergence. To bridge this gap, we propose an Enhanced Bayesian Upper Confidence Bound (EBUCB) framework that can efficiently accommodate bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that for Bernoulli multi-armed bandits, EBUCB can achieve the optimal regret order $O(\log T)$ if the inference error measured by two different $\alpha$-divergences is less than a constant, regardless of how large this constant is. Our study provides the first theoretical regret bound that is better than $o(T)$ in the setting of constant approximate inference error, to our best knowledge. Furthermore, in concordance with the negative results in previous studies, we show that only one bounded $\alpha$-divergence is insufficient to guarantee a sub-linear regret.
翻译:Bayesian Bayesian 土匪算法,近似于Bayesian 的推论被广泛用于现实世界的应用中。然而,这些方法的优异实际表现与其理论依据之间存在很大的差异。先前的研究仅表明一个负面理论结果:Thompson 取样可能出现最坏的线性遗憾$\Omega(T),在以1美元计的推论错误上有一个不变的阈值。为了缩小这一差距,我们提议建立一个增强的Bayesian 高层信任(EBUCB) 框架,这个框架能够有效地适应近似推论中的土匪问题。我们的理论分析表明,对于Bennoulli 多武装匪来说,EBUBCB可以实现最优的遗憾顺序$(\log T),如果两个不同的美元-美元测量的推论误差差差差差低于一个常数,不管这个常数有多大。我们的研究提供了第一个理论遗憾界限比$(EBUBCBCB)好于一个常数的推论错误,我们最好的了解。此外,与前几次研究显示,与负差差差差差的结果显示,我们只有1美元。