Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the collected user data trajectories; we show that this can cause standard variance estimators for i.i.d. data to underestimate the true variance of common estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively collected data via Z-estimation. Specifically, we introduce the adaptive sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, to prove our results we develop significant theory for empirical processes on non-i.i.d., adaptively collected, longitudinal data. This work is motivated by our efforts in designing experiments in which online reinforcement learning algorithms pool data across users to learn to optimize treatment decisions, yet reliable statistical inference is essential for conducting a variety of statistical analyses after the experiment is over.
翻译:在线强化学习和其他适应性抽样算法越来越多地用于数字干预实验,以优化用户在一段时间内的治疗提供。在这项工作中,我们侧重于由一大批适应性抽样算法收集的纵向用户数据,这些算法旨在利用来自多个用户的累积数据优化在线处理决定。综合或“汇总”数据使不同用户的适应性抽样算法有可能更快地学习。然而,这些算法通过汇集,促使所收集的用户数据轨迹之间产生依赖性;我们表明,这可能导致对i.d.数据的标准差异估计器,以低估这一数据类型的共同估计器的真正差异。我们开发了新颖的方法,对通过Z-估计度收集的此类适应性数据进行各种统计分析。具体地说,我们引入了适应性三明治差异估计器,一个经过修正的三明治估计器,在适应性抽样中得出一致的差异估计值。此外,为了证明我们的成果,我们为非i.i.d.d.(适应性收集的、纵向数据)的经验过程发展了重要的理论。这项工作的动机是,我们在设计实验中进行各种在线强化学习统计算法分析的基本数据组合数据,以便用户在进行可靠的统计分析后,在进行最优化分析后,以便进行统计分析。