Off-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing methods focus on the expected return, define the target parameter through averaging and provide a point estimator only. In this paper, we develop a novel procedure to produce reliable interval estimators for a target policy's return starting from any initial state. Our proposal accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. Our main idea lies in designing a pseudo policy that generates subsamples as if they were sampled from the target policy so that existing conformal prediction algorithms are applicable to prediction interval construction. Our methods are justified by theories, synthetic data and real data from short-video platforms.
翻译:在一些应用中,新的政策需要在在线部署之前进行离线评估,非政策评价至关重要。大多数现有方法侧重于预期回报,通过平均确定目标参数,并仅提供点测算器。在本文中,我们开发了一种新的程序,为从任何初始状态开始的目标政策回报生成可靠的间隔估计器。我们的提案说明了回报围绕预期的变异性,侧重于个体效应并提供有效的不确定性量化。我们的主要想法在于设计一个假政策,产生子样本,仿佛它们是从目标政策中抽样的,这样现有的符合要求的预测算法就可适用于预测间隔构建。我们的方法有理论、合成数据和短视平台的真实数据作为合理依据。