We study the behavior of Thompson sampling from the perspective of weak convergence. In the regime where the gaps between arm means scale as $1/\sqrt{n}$ with the time horizon $n$, we show that the dynamics of Thompson sampling evolve according to discrete versions of SDEs and random ODEs. As $n \to \infty$, we show that the dynamics converge weakly to solutions of the corresponding SDEs and random ODEs. (Recently, Wager and Xu (arXiv:2101.09855) independently proposed this regime and developed similar SDE and random ODE approximations.) Our weak convergence theory covers both the classical multi-armed and linear bandit settings, and can be used, for instance, to obtain insight about the characteristics of the regret distribution when there is information sharing among arms, as well as the effects of variance estimation, model mis-specification and batched updates in bandit learning. Our theory is developed from first-principles and can also be adapted to analyze other sampling-based bandit algorithms.
翻译:我们从趋同薄弱的角度研究Thompson抽样行为。在这种制度下,手臂比例的差距意味着1美元/ 斯克特{n}美元与时平面值(美元),我们显示Thompson抽样的动态根据SDE和随机的ODE的不同版本演变。作为美元-美元-美元-美元-美元-美元-美元-美元-美元-美元-美元-美元-美元-美元-美元-美元-美元,我们显示,这些动态微小地聚集到相应的SDE和随机的ODE的解决方案中。(最近,Wager和Xu(arXiv:2101.09855)独立地提出了这个制度,并发展了类似的SDE和随机的ODE近似值。 )我们薄弱的趋同理论既包括传统的多臂和线形宽宽带设置,例如,可以用来了解武器之间分享信息时的遗憾分布的特点,以及差异估计、模型错误区分和分批更新土匪学习的结果。我们的理论是从头原则发展出来的,也可以用来分析其他基于抽样的土匪算法。