Standard bandit algorithms that assume continual reallocation of measurement effort are challenging to implement due to delayed feedback and infrastructural/organizational difficulties. Motivated by practical instances involving a handful of reallocation epochs in which outcomes are measured in batches, we develop a new adaptive experimentation framework that can flexibly handle any batch size. Our main observation is that normal approximations, which are universal in statistical inference, can also guide the design of scalable adaptive designs. By deriving an asymptotic sequential experiment, we formulate a dynamic program that can leverage prior information on average rewards. We propose a simple iterative planning method, Residual Horizon Optimization, which selects sampling allocations by optimizing a planning objective with stochastic gradient descent. Our method significantly improves statistical power over standard adaptive policies, even when compared to Bayesian bandit algorithms (e.g., Thompson sampling) that require full distributional knowledge of individual rewards. Overall, we expand the scope of adaptive experimentation to settings which are difficult for standard adaptive policies, including problems with a small number of reallocation epochs, low signal-to-noise ratio, and unknown reward distributions.
翻译:标准的赌博机算法假定度量工作的不断重新分配非常具有挑战性,因为存在延迟的反馈和基础设施/组织上的难点。受到实践实例的启发,涉及几个重新分配时期,其中结果是以批处理方式测量的,我们开发了一个新的自适应实验框架,它可以灵活处理任何批处理大小。我们的主要观察是正常近似,它在统计推理中是通用的,也可以指导可扩展自适应设计的设计。通过推导渐近顺序实验,我们制定了一种动态规划,可以利用关于平均奖励的先前信息。我们提出了一种简单的迭代规划方法,Residual Horizon Optimization,通过随机梯度下降优化计划目标来选择采样分配。我们的方法显着提高了标准自适应策略的统计功率,即使与需要完整分布知识的贝叶斯赌博算法(例如Thompson抽样)相比。总的来说,我们将自适应实验的范围扩展到标准自适应策略困难的设置中,包括只有少数重新分配时期,信噪比低和未知奖励分布的问题。