Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the collected user data trajectories; we show that this can cause standard variance estimators for i.i.d. data to underestimate the true variance of common estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively collected data via Z-estimation. Specifically, we introduce the adaptive sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, to prove our results we develop novel theoretical tools for empirical processes on adaptively collected longitudinal data which may be of independent interest. This work is motivated by our efforts in designing experiments in which online reinforcement learning algorithms optimize treatment decisions, yet statistical inference is essential for conducting analyses after the experiment concludes.
翻译:在线强化学习和其他适应性抽样算法越来越多地用于数字干预实验,以优化用户长期的治疗提供。在这项工作中,我们侧重于由一大批适应性抽样算法收集的纵向用户数据,这些算法旨在利用来自多个用户的累积数据优化在线处理决定。综合或“汇总”数据使不同用户的适应性抽样算法有可能更快地学习。然而,这些算法通过汇集,促使所收集的用户数据轨迹之间产生依赖性;我们表明,这可能导致对i.d.数据的标准差异估计器,以低估这一数据类型的共同估计器的真正差异。我们开发了新颖的方法,对通过Z-估计法收集的此类适应性数据进行各种统计分析。具体地说,我们引入了适应性三明治差异估计器,一个经过修正的三明治估计器,在适应性抽样中得出一致的差异估计值。此外,为了证明我们的成果,我们开发了用于对适应性收集的长度数据进行经验实验的新型理论工具,可能具有独立的兴趣。这项工作的动力在于我们设计实验,通过在线强化学习优化处理决定,但统计分析后,在进行必要的实验中完成。