Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy. It is critical in a number of sequential decision making problems ranging from healthcare to technology industries. Most of the work in existing literature is focused on evaluating the mean outcome of a given policy, and ignores the variability of the outcome. However, in a variety of applications, criteria other than the mean may be more sensible. For example, when the reward distribution is skewed and asymmetric, quantile-based metrics are often preferred for their robustness. In this paper, we propose a doubly-robust inference procedure for quantile OPE in sequential decision making and study its asymptotic properties. In particular, we propose utilizing state-of-the-art deep conditional generative learning methods to handle parameter-dependent nuisance function estimation. We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform. In particular, we find that our proposed estimator outperforms classical OPE estimators for the mean in settings with heavy-tailed reward distributions.
翻译:离岸评估(OPE)涉及利用可能不同的行为政策产生的离线数据评估一项新的目标政策,它对于从保健到技术产业等一系列连续决策问题至关重要。现有文献中的大部分工作侧重于评价某一政策的中值结果,忽视了结果的变异性。然而,在各种应用中,除中值以外的标准可能更为合理。例如,如果奖励分配偏斜和不对称,基于昆虫的衡量标准往往因其稳健性而更受偏好。在本文中,我们提议在连续决策中采用定量OPE的双倍粗糙推论程序,研究其无现性特性。特别是,我们提议采用最先进的有条件的学习方法处理依赖参数的扰动函数估计。我们通过模拟和短视平台的真实世界数据集来展示这一拟议估算的优点。我们发现,我们提议的估算器在与重尾线环境中平均分布的模拟OPE估计器优优于典型的OPE估计器。