Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.
翻译:离岸政策评价(OPE)是利用可能不同的行为政策产生的一些事先收集的观测数据来估计目标政策的回报的一种方法,在某些情况下,可能存在一些无法计量的变量,这些变量可能混淆了行动回报或行动-后期关系,使许多现有的OPE方法无效。本文为POP在具有共性的Markov决策过程中保持一致性地(MDPs)开发了一种基于工具的变量(IV)方法。与单一阶段的决策类似,我们表明,第四阶段也使我们能够正确地确定目标政策在无限地平线环境中的价值。此外,我们提出了一个高效和稳健的价值估计器,并通过对世界领先的短视平台的真实数据进行广泛的模拟和分析来说明其有效性。