Standard bandit algorithms that assume continual reallocation of measurement effort are challenging to implement due to delayed feedback and infrastructural/organizational difficulties. Motivated by practical instances involving a handful of reallocation epochs in which outcomes are measured in batches, we develop a new adaptive experimentation framework that can flexibly handle any batch size. Our main observation is that normal approximations universal in statistical inference can also guide the design of scalable adaptive designs. By deriving an asymptotic sequential experiment, we formulate a dynamic program that can leverage prior information on average rewards. State transitions of the dynamic program are differentiable with respect to the sampling allocations, allowing the use of gradient-based methods for planning and policy optimization. We propose a simple iterative planning method, Residual Horizon Optimization, which selects sampling allocations by optimizing a planning objective via stochastic gradient-based methods. Our method significantly improves statistical power over standard adaptive policies, even when compared to Bayesian bandit algorithms (e.g., Thompson sampling) that require full distributional knowledge of individual rewards. Overall, we expand the scope of adaptive experimentation to settings which are difficult for standard adaptive policies, including problems with a small number of reallocation epochs, low signal-to-noise ratio, and unknown reward distributions.
翻译:标准赌博算法假定持续重新分配测量工作,但由于延迟反馈和基础设施/组织困难而难以实现。受实际情况下涉及少量重新分配时期,并以批处理测量结果为基础的启发,我们开发了一个新的自适应实验框架,可以灵活地处理任何批量大小。我们的主要观察是,在统计推断中普遍的正常近似也可以指导可扩展自适应设计的设计。通过导出渐近顺序实验,我们制定了一个动态规划,可以利用有关平均收益的先前信息。动态规划的状态转换对抽样分配具有可微分性,允许使用基于梯度的方法进行规划和策略优化。我们提出了一种简单的迭代计划方法,残留时间优化,通过随机梯度法选择采样分配来优化计划目标。即使与需要完整分布知识的贝叶斯赌博算法(例如Thompson采样)相比,我们的方法也显著提高了统计功率。总体而言,我们将自适应实验的范围扩展到标准自适应政策难以处理的情况,包括重新分配时期较少、信噪比低和未知奖励分布的问题。