Learning efficiently a causal model of the environment is a key challenge of model-based RL agents operating in POMDPs. We consider here a scenario where the learning agent has the ability to collect online experiences through direct interactions with the environment (interventional data), but has also access to a large collection of offline experiences, obtained by observing another agent interacting with the environment (observational data). A key ingredient, that makes this situation non-trivial, is that we allow the observed agent to interact with the environment based on hidden information, which is not observed by the learning agent. We then ask the following questions: can the online and offline experiences be safely combined for learning a causal model ? And can we expect the offline experiences to improve the agent's performances ? To answer these questions, we import ideas from the well-established causal framework of do-calculus, and we express model-based reinforcement learning as a causal inference problem. Then, we propose a general yet simple methodology for leveraging offline data during learning. In a nutshell, the method relies on learning a latent-based causal transition model that explains both the interventional and observational regimes, and then using the recovered latent variable to infer the standard POMDP transition model via deconfounding. We prove our method is correct and efficient in the sense that it attains better generalization guarantees due to the offline data (in the asymptotic case), and we illustrate its effectiveness empirically on synthetic toy problems. Our contribution aims at bridging the gap between the fields of reinforcement learning and causality.
翻译:高效地学习环境的因果关系模型是基于模型的 RL 代理商在POMDP 中运行的关键挑战。 我们在此考虑一个这样的情景,即学习代理商能够通过与环境的直接互动(干预数据)收集在线经验,但也可以通过观测另一个代理商与环境互动(观察数据)获得大量离线经验。 使这种情况变得非三角性的一个关键因素是,我们允许观察到的代理商在隐藏信息的基础上与环境互动,而学习代理商没有看到这些信息。 然后,我们提出下列问题:在线和离线经验能否安全地结合,以学习一个因果关系模型? 我们能否期待离线经验来改善代理商的绩效? 为了回答这些问题,我们从已经确立的与环境互动代理商的因果关系框架(观察数据数据)中引进了各种想法,我们把基于模型的强化学习作为一种因果关系问题。 然后,我们提出了一个在学习过程中利用离线数据来利用离线数据的一般简单而简单的方法。 在一个螺旋中,我们的方法依赖于学习基于潜在因果过渡模式的模型, 来解释我们从中获取的快速数据,我们从中获取数据的方法,我们从中可以理解到从中获取。