Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies useful for the underlying estimator selection task. Comprehensive experiments on both synthetic and real-world company data demonstrate that the proposed procedure substantially improves the estimator selection compared to a non-adaptive heuristic.
翻译:离岸评估(OPE)旨在仅使用离线记录数据准确评估反事实政策的绩效。虽然已经开发出许多估计数据,但是没有单一的测算器,以其他测算器为主,因为估测器的准确性会因某一OPE任务(如评价政策、行动数量和噪音水平)而有很大差异。因此,数据驱动的估测器选择问题正在变得越来越重要,并可能对OPE的准确性产生重大影响。然而,仅仅使用登录数据的最准确估计器是相当具有挑战性的,因为通常无法使用测算器的地盘估计准确性。本文研究了首次选择OPE的测算器这一具有挑战性的问题。特别是,我们通过适当分选现有登录数据和制定对估算器选择任务有用的假政策,使估算器选择能够适应给定的OPE任务。关于合成和真实世界公司数据的全面实验表明,拟议的程序大大改进了估算器选择与非适应性肝脏的对比。