In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived under the assumption of a perfect Markov decision process (MDP) model. Here we tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, we consider estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state. We tackle two questions: what conditions allow us to identify the target policy value from the observed data and, given identification, how to best estimate it. To answer these, we extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible by the existence of so-called bridge functions. We then show how to construct semiparametrically efficient estimators in these settings. We term the resulting framework proximal reinforcement learning (PRL). We demonstrate the benefits of PRL in an extensive simulation study and on the problem of sepsis management.
翻译:在观测数据的离线强化学习应用中,例如在医疗保健或教育方面,一个普遍的关注点是,观测到的动作可能会受到未观测因素的影响,从而引起混淆并偏置在假设一个完美的马尔可夫决策过程模型下得出的估计。在这里,我们通过考虑局部观测马尔可夫决策过程(POMDP)中的离线策略评估来解决这个问题。具体而言,我们考虑在一个POMDP中给定由不同的、未知的策略生成的具有部分状态观测的轨迹来估计目标策略的价值,这个策略可能依赖于未观测状态。我们解决了两个问题:什么条件允许我们从观测到的数据中识别目标策略的价值,以及在识别的情况下,如何最好地估计它。为了回答这些问题,我们将Proximal因果推断的框架扩展到我们的POMDP环境中,通过所谓的桥函数的存在提供了各种设置,从而使识别成为可能。然后,我们展示了如何在这些设置中构建半参数的有效估计器。我们称这个结果框架为Proximal强化学习(PRL)。我们通过广泛的模拟研究和感染性休克管理问题来展示PRL的益处。