Off-policy evaluation learns a target policy's value with a historical dataset generated by a different behavior policy. In addition to a point estimate, many applications would benefit significantly from having a confidence interval (CI) that quantifies the uncertainty of the point estimate. In this paper, we propose a novel procedure to construct an efficient, robust, and flexible CI on a target policy's value. Our method is justified by theoretical results and numerical experiments. A Python implementation of the proposed procedure is available at https://github.com/RunzheStat/D2OPE.
翻译:离政策评价以不同行为政策产生的历史数据集来学习目标政策的价值;除了点数估计外,许多应用将大大受益于一个信任间隔(CI),该间隔可以量化点数估计的不确定性;在本文中,我们提出一个新的程序,在目标政策的价值上构建一个高效、稳健和灵活的CI。我们的方法有理论结果和数字实验的正当理由。在https://github.com/RunzheStat/D2OPE上可以查阅拟议的程序的Python实施情况。