取舍统计分析与奖励:将统一随机分配和奖励最大化相结合的适应性实验的调整性统计分析与报偿的调整性实验的比值 (Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization)

Multi-armed bandit algorithms like Thompson Sampling (TS) can be used to conduct adaptive experiments, in which maximizing reward means that data is used to progressively assign participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one, and failing to conclude there is a difference in arms when there truly is one. We tackle this by introducing a novel heuristic algorithm, called TS-PostDiff (Posterior Probability of Difference). TS-PostDiff takes a Bayesian approach to mixing TS and Uniform Random (UR): the probability a participant is assigned using UR allocation is the posterior probability that the difference between two arms is 'small' (below a certain threshold), allowing for more UR exploration when there is little or no reward to be gained. We evaluate TS-PostDiff against state-of-the-art strategies. The empirical and simulation results help characterize the trade-offs of these approaches between reward, False Positive Rate (FPR), and statistical power, as well as under which circumstances each is effective. We quantify the advantage of TS-PostDiff in performing well across multiple differences in arm means (effect sizes), showing the benefits of adaptively changing randomization/exploration in TS in a "Statistically Considerate" manner: reducing FPR and increasing statistical power when differences are small or zero and there is less reward to be gained, while exploiting more when differences may be large. This highlights important considerations for future algorithm development and analysis to better balance reward and statistical analysis.

翻译：Thompson Sampling(TS)等多武装土匪算法可以用来进行适应性实验,在这种实验中,最大限度的奖励意味着利用数据逐步分配参与者到更有效的武器。这种派任战略增加了统计假设测试的风险,在没有武器的情况下发现武器之间的差异,而在真正出现武器时无法得出武器的差异。我们通过采用新型的休克算法来解决这个问题,称为TS-PostDiff(差异的不易性) 。TS-PostDiff采用巴伊西亚方法混合TS和统一随机(UR):使用UR分配的参与者被分配的概率是两个武器之间的差别是“小”(低于某一阈阈值)的后期概率,这样,在没有获得任何奖励的情况下可以进行更多的UR探索。我们用新颖的休克算法,称为TS-PostDiff(差异的不易变现性可能性)。TS-Prestiff采用的经验和模拟结果有助于确定这些方法在奖励、假正率(FPR)和统计实力之间的取舍利,在每一种情况下都是有效的。我们用在不断变换的变换的统计方法上,在变换的变换的优势和变价方法中,在变换时,在变换价变换的优势和变换的优势和变换方式上,在变价变换的优势是显示的优势,在变换式分析中,在不断变换的优势在统计方法上,在变换的利在变换的变换方式上,在变换方式上,在变换的利。