There is a great desire to use adaptive sampling methods, such as reinforcement learning (RL) and bandit algorithms, for the real-time personalization of interventions in digital applications like mobile health and education. A major obstacle preventing more widespread use of such algorithms in practice is the lack of assurance that the resulting adaptively collected data can be used to reliably answer inferential questions, including questions about time-varying causal effects. Current methods for statistical inference on such data are insufficient because they (a) make strong assumptions regarding the environment dynamics, e.g., assume a contextual bandit or Markovian environment, or (b) require data to be collected with one adaptive sampling algorithm per user, which excludes data collected by algorithms that learn to select actions by pooling the data of multiple users. In this work, we make initial progress by introducing the adaptive sandwich estimator to quantify uncertainty; this estimator (a) is valid even when user rewards and contexts are non-stationary and highly dependent over time, and (b) accommodates settings in which an online adaptive sampling algorithm learns using the data of all users. Furthermore, our inference method is robust to misspecification of the reward models used by the adaptive sampling algorithm. This work is motivated by our work designing experiments in which RL algorithms are used to select actions, yet reliable statistical inference is essential for conducting primary analyses after the trial is over.
翻译:人们非常希望使用适应性抽样方法,例如强化学习(RL)和土匪算法,使移动保健和教育等数字应用中的干预措施实时个性化;在实践中,妨碍此类算法更广泛使用的一个主要障碍是无法保证由此产生的适应性收集的数据能够用于可靠地回答推断问题,包括关于时间变化因果影响的问题;目前这类数据的统计推断方法不够充分,因为它们(a) 对环境动态动态作出强有力的假设,例如假设背景强盗或马科维环境,或(b) 要求每个用户用一个适应性取样算法收集数据,其中排除了通过汇集多个用户的数据来学会选择行动的算法所收集的数据;在这项工作中,我们通过采用适应性三明治估算器来量化不确定性而取得初步进展;这一估计(a) 即使在用户的奖赏和情况不固定,而且随着时间的推移高度依赖性很强,以及(b) 适应一个在线适应性抽样算法在使用所有用户的数据进行在线适应性抽样算法学习。此外,在进行这种精确性算法分析后,我们采用的精确性算法是用于进行统计分析的精确性分析。