We propose a new diffusion-asymptotic analysis for sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with $ n $ time steps, we let the mean reward gaps between actions scale to the order $1/\sqrt{n}$ so as to preserve the difficulty of the learning task as $n$ grows. In this regime, we show that the behavior of a class of sequentially randomized Markov experiments converges to a diffusion limit, given as the solution of a stochastic differential equation. The diffusion limit thus enables us to derive refined, instance-specific characterization of the stochastic dynamics of adaptive experiments. As an application of this framework, we use the diffusion limit to obtain several new insights on the regret and belief evolution of Thompson sampling. We show that a version of Thompson sampling with an asymptotically uninformative prior variance achieves nearly-optimal instance-specific regret scaling when the reward gaps are relatively large. We also demonstrate that, in this regime, the posterior beliefs underlying Thompson sampling are highly unstable over time.
翻译:我们建议对按顺序随机进行的实验进行新的扩散-无症状分析,包括解决多武装土匪问题时产生的实验。在一次以美元计时步骤的实验中,我们让行动规模之间的平均报酬差距达到1美元/斯克特{n}的顺序,以便随着美元的增长而保持学习任务的难度。在这个制度下,我们表明,按顺序随机的马可夫试验类别的行为接近于一个扩散限度,因为这是一个随机差异方程式的解决方案。扩散限度使我们得以对适应性实验的随机动态进行精细的、具体实例的定性。作为这一框架的应用,我们利用扩散限度来获得关于汤普森取样的遗憾和信仰演变情况的新见解。我们表明,在奖励差距相对较大时,一种具有无症状性、无说服力的前差异的汤普森取样方法可以取得几乎最佳的具体实例的遗憾比例。我们还表明,在这个制度下,作为汤普森取样基础的事后信念随着时间的推移非常不稳定。