Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments, in which maximizing reward means that data is used to progressively assign more participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one, and failing to conclude there is a difference in arms when there truly is one. We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis, with the benefits of reward maximization achieved by Thompson Sampling (TS). First, Top-Two Thompson Sampling adds a fixed amount of uniform random allocation (UR) spread evenly over time. Second, a novel heuristic algorithm, called TS PostDiff (Posterior Probability of Difference). TS PostDiff takes a Bayesian approach to mixing TS and UR: the probability a participant is assigned using UR allocation is the posterior probability that the difference between two arms is `small' (below a certain threshold), allowing for more UR exploration when there is little or no reward to be gained. We find that TS PostDiff method performs well across multiple effect sizes, and thus does not require tuning based on a guess for the true effect size.
翻译:Thompson Sampling等多武装强盗算法可以用来进行适应性实验,在这种实验中,最大限度的奖励意味着将数据用于逐步分配更多的参与者到更有效的武器。这种派任战略增加了统计假设测试的风险,在没有武器的情况下发现武器之间的差别,而在真正存在武器时无法断定武器上存在差异。我们模拟了两个双臂实验,将统一随机化对统计分析的好处结合起来,奖励Thompson Samling(TS)实现最大化的好处。首先,上两个汤普森取样法增加了固定数量的统一随机分配(UR)平均分布。第二,新型的超重算法,称为TS PostDiff(差异的不易变性)。TS PostDiff采用巴伊斯式方法来混合TS和UR:一个参与者被指派使用UR分配的概率是事后概率,即两种武器之间的差别是`小'(低于某一阈值),允许在很少或没有奖励的情况下进行更多的UR式的随机分配(UR),在一段时间内进行。第二,称为TS PostDTS的大小影响,因此需要进行多重调整。