优化强化学习:部分观察的Markov决策过程中的高效非政策评价 (Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes)

In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived under the assumption of a perfect Markov decision process (MDP) model. Here we tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, we consider estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state. We tackle two questions: what conditions allow us to identify the target policy value from the observed data and, given identification, how to best estimate it. To answer these, we extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible by the existence of so-called bridge functions. We then show how to construct semiparametrically efficient estimators in these settings. We term the resulting framework proximal reinforcement learning (PRL). We demonstrate the benefits of PRL in an extensive simulation study.

翻译：在应用离线强化学习观察数据方面,例如在保健或教育方面,人们普遍关注的是,观察到的行动可能会受到未观察的因素的影响,从而导致在假设完善的马尔科夫决策程序(MDP)模式下得出的估计的混乱和偏差。我们在这里通过在部分观测到的MDP(POMDP)中考虑非政策评价来解决这个问题。具体地说,我们考虑估计POMDP中某一目标政策的价值,给它带来轨迹,只有根据未观察状态的不同和未知政策产生的部分国家观测。我们处理两个问题:什么条件使我们能够从所观察的数据中确定目标政策价值,并在查明如何作出最佳估计。为了回答,我们把准因果推断的框架扩大到我们的POMDP环境,提供了各种环境,在这些环境中,由于存在所谓的桥梁功能,可以进行鉴别。然后我们展示如何在这些环境中建立半对称高效的估测器。我们将由此产生的框架称为准加固法学习(PRL),我们在广泛的模拟研究中展示PRL的好处。