This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.
翻译:本文涉及根据在无限地平线设置中预先收集的观测数据为目标政策离线值构建信任间隔的问题。 大部分现有工程假设不存在无法测量的变量来混淆所观察到的行动。 但是,这一假设在保健和技术行业等实际应用中可能遭到违反。 在本文中,我们显示,有些辅助变量在调节行动对系统动态的影响时,目标政策值可以在一个混杂的马尔科夫决策程序中识别。 基于这一结果,我们开发了一个高效的离政策值估算器,该估算器对潜在的定点模型非常可靠,并提供严格的不确定性量化。 我们的方法有理论结果、模拟和从搭乘共享公司获得的真实数据集作为合理依据。 在 https://github.com/Mamba413/cope 上可以查阅拟议程序的Python实施情况。