Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit through the reuse of incomplete resources. Compared to conventional imitation learning (IL), LfO is more challenging because of the lack of expert action guidance. In both conventional IL and LfO, distribution matching is at the heart of their foundation. Traditional distribution matching approaches are sample-costly which depend on on-policy transitions for policy learning. Towards sample-efficiency, some off-policy solutions have been proposed, which, however, either lack comprehensive theoretical justifications or depend on the guidance of expert actions. In this work, we propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner. To further accelerate the learning procedure, we regulate the policy update with an inverse action model, which assists distribution matching from the perspective of mode-covering. Extensive empirical results on challenging locomotion tasks indicate that our approach is comparable with state-of-the-art in terms of both sample-efficiency and asymptotic performance.
翻译:从观察(LfO)中学习是一种实用的强化学习方案,许多应用可以通过重新利用不完全的资源而受益。与传统的模仿学习(IL)相比,LfO由于缺乏专家行动指导而更具挑战性。在传统的IL和LfO中,分配匹配是其基础的核心。传统的分配匹配方法具有抽样成本,取决于政策过渡,取决于政策学习的政策过渡。在抽样效率方面,提出了一些非政策性解决办法,但缺乏全面的理论依据,或取决于专家行动的指导。在这项工作中,我们建议一种抽样高效的LfO方法,以便能够以有原则的方式实现脱离政策的优化。为了进一步加快学习程序,我们用一种反向行动模式管理政策更新,该模式有助于从模式覆盖的角度进行分配。关于具有挑战性的locomotion任务的广泛经验结果表明,我们的方法在抽样效率和无损业绩方面与最新技术相近。