This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies.
翻译:本文涉及根据在无限地平线环境中预先收集的观测数据为目标政策离线值构建信任间隔。 大部分现有工程假设不存在无法测量的变量, 从而混淆了所观察到的行动。 但是, 在医疗保健和技术产业等实际应用中, 这一假设很可能被违反。 在本文中, 我们显示, 借助一些辅助变量来调节行动对系统动态的影响, 目标政策值可以在一个混杂的Markov 决策程序中识别。 基于这一结果, 我们开发了一个高效的离政策值估算器, 它可以强有力地模拟错误区分, 并提供严格的不确定性量化。 我们的方法有理论结果、 模拟 和 从搭乘公司获得的真实数据集 。