We study the offline reinforcement learning (RL) in the face of unmeasured confounders. Due to the lack of online interaction with the environment, offline RL is facing the following two significant challenges: (i) the agent may be confounded by the unobserved state variables; (ii) the offline data collected a prior does not provide sufficient coverage for the environment. To tackle the above challenges, we study the policy learning in the confounded MDPs with the aid of instrumental variables. Specifically, we first establish value function (VF)-based and marginalized importance sampling (MIS)-based identification results for the expected total reward in the confounded MDPs. Then by leveraging pessimism and our identification results, we propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy under minimal data coverage and modeling assumptions. Lastly, our extensive theoretical investigations and one numerical study motivated by the kidney transplantation demonstrate the promising performance of the proposed methods.
翻译:由于缺乏与环境的在线互动,离线RL面临以下两大挑战:(一) 代理人可能被未观测到的状态变量所困扰;(二) 先前收集的离线数据不足以覆盖环境。为了应对上述挑战,我们在工具变量的帮助下,在有缺陷的 MDP 中研究政策学习。具体地说,我们首先为混为一谈的 MDP 中预期的总报酬确定价值功能(VF) 基础和边缘化重要性抽样(MIS ) 。然后,通过利用悲观主义和我们的识别结果,我们提出了各种政策学习方法,其中提出了在最低数据覆盖面和模型假设下找到最佳类内政策的有限抽样次优化保证。最后,我们广泛的理论调查和一项由肾移植驱动的数字研究显示了拟议方法的良好表现。