Off-policy evaluation (OPE) is the method that attempts to estimate the performance of decision making policies using historical data generated by different policies without conducting costly online A/B tests. Accurate OPE is essential in domains such as healthcare, marketing or recommender systems to avoid deploying poor performing policies, as such policies may hart human lives or destroy the user experience. Thus, many OPE methods with theoretical backgrounds have been proposed. One emerging challenge with this trend is that a suitable estimator can be different for each application setting. It is often unknown for practitioners which estimator to use for their specific applications and purposes. To find out a suitable estimator among many candidates, we use a data-driven estimator selection procedure for off-policy policy performance estimators as a practical solution. As proof of concept, we use our procedure to select the best estimator to evaluate coupon treatment policies on a real-world online content delivery service. In the experiment, we first observe that a suitable estimator might change with different definitions of the outcome variable, and thus the accurate estimator selection is critical in real-world applications of OPE. Then, we demonstrate that, by utilizing the estimator selection procedure, we can easily find out suitable estimators for each purpose.
翻译:外部政策评价(OPE)是试图利用不同政策产生的历史数据来评估决策政策绩效的方法,而不必在网上进行昂贵的A/B测试。准确的OPE在保健、营销或建议系统等领域至关重要,以避免部署执行不力的政策,因为此类政策可能会使人的生活更加糟糕,或破坏用户的经验。因此,提出了许多具有理论背景的OPE方法。这一趋势的一个新挑战是每个应用环境都有一个适当的估计符。对于估计者来说,用于其具体应用和目的的执业者来说,通常并不为人所知。为了在很多候选人中找到合适的估计者,我们使用数据驱动的估算器选择非政策性业绩估计者的程序作为切实可行的解决办法。作为概念的证明,我们使用我们的程序选择了最佳估计者来评价真实世界在线内容交付服务上的优惠治疗政策。在实验中,我们首先发现,适当的估计者可能会改变结果变量的不同定义,因此准确的估算者选择方法在现实世界中非常关键,我们利用每个选择程序,然后利用实际选择程序。