Learning optimal policies from historical data enables personalization in a wide variety of applications including healthcare, digital recommendations, and online education. The growing policy learning literature focuses on settings where the data collection rule stays fixed throughout the experiment. However, adaptive data collection is becoming more common in practice, from two primary sources: 1) data collected from adaptive experiments that are designed to improve inferential efficiency; 2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g. contextual bandits). Yet adaptivity complicates the optimal policy identification ex post, since samples are dependent, and each treatment may not receive enough observations for each type of individual. In this paper, we make initial research inquiries into addressing the challenges of learning the optimal policy with adaptively collected data. We propose an algorithm based on generalized augmented inverse propensity weighted (AIPW) estimators, which non-uniformly reweight the elements of a standard AIPW estimator to control worst-case estimation variance. We establish a finite-sample regret upper bound for our algorithm and complement it with a regret lower bound that quantifies the fundamental difficulty of policy learning with adaptive data. When equipped with the best weighting scheme, our algorithm achieves minimax rate optimal regret guarantees even with diminishing exploration. Finally, we demonstrate our algorithm's effectiveness using both synthetic data and public benchmark datasets.
翻译:从历史数据中学习最佳政策,使个人能够在广泛的各种应用中实现个性化,包括医疗保健、数字建议和在线教育。越来越多的政策学习文献侧重于数据收集规则在整个实验中保持不变的环境。然而,适应性数据收集在实践中越来越普遍,从两个主要来源开始:1)从旨在提高推论效率的适应性实验中收集的数据;2)从逐步制定行动政策以提高业绩的生产系统收集的数据(例如,背景强盗),但适应性使最佳政策识别在事后变得复杂,因为样本是依赖性的,每一种治疗都可能得不到足够的个人观察。在本文中,我们初步研究如何应对学习最佳政策与适应性收集的数据所面临的挑战。我们提出一种基于普遍增强反向偏差加权估测算法的算法,这种测算法不统一地调整标准AIPW测算法的要素,以控制最坏的估测差。我们为算法设定了限值上限,对每种处理方法可能得不到足够的观察结果加以补充。我们提出的初步研究调查调查是,以最差的限度来量化政策中最难度,我们用最差的测算法展示了最差的模型,同时用最差的模型展示了我们最差的精确的数据。