Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the sampled user data trajectories; we show that this can cause standard variance estimators for i.i.d. data to underestimate the true variance of common estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively sampled data via Z-estimation. Specifically, we introduce the \textit{adaptive} sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, to prove our results we develop novel theoretical tools for empirical processes on non-i.i.d., adaptively sampled longitudinal data which may be of independent interest. This work is motivated by our efforts in designing experiments in which online reinforcement learning algorithms optimize treatment decisions, yet statistical inference is essential for conducting analyses after experiments conclude.
翻译:在线强化学习和其他自适应采样算法越来越多地用于数字干预实验中,以优化用户随时间的治疗。在这项工作中,我们关注由大量自适应采样算法设计的纵向用户数据,这些算法旨在使用多个用户的积累数据在线优化治疗决策。跨用户合并或“汇集”数据可以让自适应采样算法潜在地更快地学习。然而,通过汇集,这些算法在采样的用户数据轨迹之间引入依赖关系;我们表明,这可以导致标准方差估计器对于此数据类型上的常见估计器低估真实的方差。我们通过 Z-估计引入了新的方法,以对这样的自适应采样数据执行各种统计分析。具体而言,我们引入自适应的夹层方差估计器,一种校正的夹层估计器,可在自适应采样下提供一致的方差估计。此外,为了证明我们的结果,我们开发了关于非独立同分布,自适应采样纵向数据的经验过程的新理论工具,这可能是独立的利益。此工作受到我们在设计实验方案中的启发,在这些实验中,在线强化学习算法可以优化治疗决策,但统计推断对于实验结束后进行分析至关重要。