Learning optimal policies from historical data enables the gains from personalization to be realized in a wide variety of applications. The growing policy learning literature focuses on a setting where the treatment assignment policy does not adapt to the data. However, adaptive data collection is becoming more common in practice, from two primary sources: 1) data collected from adaptive experiments that are designed to improve inferential efficiency; 2) data collected from production systems that are adaptively evolving an operational policy to improve performance over time (e.g. contextual bandits). In this paper, we aim to address the challenge of learning the optimal policy with adaptively collected data and provide one of the first theoretical inquiries into this problem. We propose an algorithm based on generalized augmented inverse propensity weighted estimators and establish its finite-sample regret bound. We complement this regret upper bound with a lower bound that characterizes the fundamental difficulty of policy learning with adaptive data. Finally, we demonstrate our algorithm's effectiveness using both synthetic data and public benchmark datasets.
翻译:从历史数据中学习最佳政策,使得个人化的收益能够在各种各样的应用中实现。越来越多的政策学习文献侧重于治疗分配政策不适应数据的环境。然而,适应性数据收集在实践中越来越普遍,主要有两个来源:(1) 从旨在提高推论效率的适应性实验中收集的数据;(2) 从适应性地发展一项业务政策以提高长期性能的生产系统(例如背景强盗)收集的数据。在本文件中,我们的目标是应对以适应性地收集的数据来学习最佳政策的挑战,并对这一问题进行首次理论调查。我们建议一种基于普遍扩大的反向偏向加权估计值的算法,并确立其有限的抽样遗憾。我们对这一遗憾进行补充,将政策学习与适应性数据基本困难的较低界限作为补充。最后,我们用合成数据和公共基准数据集来证明我们的算法的有效性。