It has become increasingly common for data to be collected adaptively, for example using contextual bandits. Historical data of this type can be used to evaluate other treatment assignment policies to guide future innovation or experiments. However, policy evaluation is challenging if the target policy differs from the one used to collect data, and popular estimators, including doubly robust (DR) estimators, can be plagued by bias, excessive variance, or both. In particular, when the pattern of treatment assignment in the collected data looks little like the pattern generated by the policy to be evaluated, the importance weights used in DR estimators explode, leading to excessive variance. In this paper, we improve the DR estimator by adaptively weighting observations to control its variance. We show that a t-statistic based on our improved estimator is asymptotically normal under certain conditions, allowing us to form confidence intervals and test hypotheses. Using synthetic data and public benchmarks, we provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.
翻译:以适应性方式收集数据,例如使用背景强盗,已变得日益普遍。这种历史数据可用于评价其他治疗分配政策,以指导未来的创新或实验。然而,如果目标政策不同于收集数据的政策,政策评价则具有挑战性。 包括双强(DR)估计员在内的大众估计员可能会受到偏见、过度差异或两者兼而有之的困扰。特别是,所收集数据的治疗分配模式与所要评估的政策模式几乎不同,DR估计员使用的重要权重爆炸,导致过度差异。在本文中,我们通过适应性加权观测来改进DR估计值,以控制其差异。我们表明,在某些条件下,基于我们改进的估测数的统计数据过于正常,使我们能够形成信任间隔和测试假象。我们利用合成数据和公共基准,为我们的估计员提高准确性和与现有替代物相比的推断性提供了经验证据。