Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems. Since many OPE estimators have been proposed and some of them have hyperparameters to be tuned, there is an emerging challenge for practitioners to select and tune OPE estimators for their specific application. Unfortunately, identifying a reliable estimator from results reported in research papers is often difficult because the current experimental procedure evaluates and compares the estimators' performance on a narrow set of hyperparameters and evaluation policies. Therefore, it is difficult to know which estimator is safe and reliable to use. In this work, we develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators' robustness to changes in hyperparameters and/or evaluation policies in an interpretable manner. Then, using the IEOE procedure, we perform extensive evaluation of a wide variety of existing estimators on Open Bandit Dataset, a large-scale public real-world dataset for OPE. We demonstrate that our procedure can evaluate the estimators' robustness to the hyperparamter choice, helping us avoid using unsafe estimators. Finally, we apply IEOE to real-world e-commerce platform data and demonstrate how to use our protocol in practice.
翻译:外部政策评价(OPE)或一般离线评价(OPE)评估假设政策的业绩,评价假设政策的业绩,只利用离线日志数据,这在网上互动涉及高风险和费用昂贵的设置(如精密医学和建议系统)的应用中特别有用。由于许多OPE估计者已经提出,其中一些估计者有超参数需要调整,实践者在选择和调整OPE估计者的具体应用方面存在新的挑战。不幸的是,从研究文件中报告的结果中找出可靠的估计者往往很困难,因为目前的实验程序评估和比较测量者在一套狭窄的超参数和评价政策方面的业绩。因此,很难知道哪些估计者是安全的和可靠的。在这项工作中,我们开发了对离线评价的跨度评价(IEOE),这是一个实验程序,用来评估OPE估计者对超参数和/或评估政策的变化的可靠性能。然后,利用IEOE程序,我们用一个广泛的电子评估,用来帮助我们在开放的服务器上展示我们的真实数据。