We propose a new diffusion-asymptotic analysis for sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with $n$ time steps, we let the mean reward gaps between actions scale to the order $1/\sqrt{n}$ so as to preserve the difficulty of the learning task as $n$ grows. In this regime, we show that the behavior of a class of sequentially randomized Markov experiments converges to a diffusion limit, given as the solution to a stochastic differential equation. The diffusion limit thus enables us to derive refined, instance-specific characterization of the stochastic dynamics of sequential experiments. We use the diffusion limit to obtain several new insights on the regret and belief evolution of sequential experiments, including Thompson sampling. On the one hand, we show that all sequential experiments whose randomization probabilities have a Lipschitz-continuous dependence on the observed data suffer from sub-optimal regret performance when the reward gaps are relatively large. On the other hand, we find that a version of Thompson sampling with an asymptotically uninformative prior variance achieves near-optimal instance-specific regret scaling, including with large reward gaps. However, although the use of uninformative priors for Thompson sampling yields good regret properties, we show that the induced posterior beliefs are highly unstable over time.
翻译:我们建议对按顺序随机进行的实验进行新的扩散-无损分析,包括解决多武装匪徒问题时出现的实验。在一次以美元计时的实验中,我们让行动规模之间的平均报酬差距维持在1美元/斯克特{n}美元的顺序上,以便随着美元的增长而保持学习任务的难度。在这个制度中,我们表明,一个按顺序随机进行的马可夫实验类别的行为接近于一个扩散限度,因为这是一个随机差异方程式的解决方案。扩散限制使我们能够对连续实验的随机动态进行精细的、针对具体实例的定性。我们利用扩散限制来获得关于连续实验(包括汤普森抽样取样)的遗憾和信仰演变的新见解。一方面,我们表明,所有顺序实验,其随机性概率性概率性对观察到的数据的依赖性都来自一个微小的遗憾表现,当奖励差距相对较大时,我们发现一个具有不精确性的具体实例的汤普森抽样抽样分析,包括先前的不稳定性先验结果,从而展示了近不具有高度精确性的价值。