We consider the evaluation and training of a new policy for the evaluation data by using the historical data obtained from a different policy. The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data. Although the standard OPE and OPL assume the same distribution of covariate between the historical and evaluation data, there often exists a problem of a covariate shift, i.e., the distribution of the covariate of the historical data is different from that of the evaluation data. In this paper, we derive the efficiency bound of OPE under a covariate shift. Then, we propose doubly robust and efficient estimators for OPE and OPL under a covariate shift by using an estimator of the density ratio between the distributions of the historical and evaluation data. We also discuss other possible estimators and compare their theoretical properties. Finally, we confirm the effectiveness of the proposed estimators through experiments.
翻译:我们考虑通过使用从不同政策中获得的历史数据对评价数据的新政策进行评价和培训。非政策评价的目标是估计一项新政策对评价数据以及非政策学习(OPL)的预期奖励,以便找到一项新的政策,使预期的奖励最大化于评价数据。虽然标准OPE和OPL在历史数据和评价数据之间假定了相同的共差分布,但经常存在一个共变变化的问题,即历史数据共变的分布与评价数据的不同。在本文件中,我们通过共变变化的变化获得OPE的效率约束。然后,我们建议对OPE和OPL进行双倍有力和高效的估算,同时使用历史数据与评价数据分布的密度比例的估算器。我们还讨论其他可能的估算器并比较其理论属性。最后,我们通过实验确认了拟议的估算器的有效性。