Offline imitation learning (IL) is a powerful method to solve decision-making problems from expert demonstrations without reward labels. Existing offline IL methods suffer from severe performance degeneration under limited expert data. Including a learned dynamics model can potentially improve the state-action space coverage of expert data, however, it also faces challenging issues like model approximation/generalization errors and suboptimality of rollout data. In this paper, we propose the Discriminator-guided Model-based offline Imitation Learning (DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations. DMIL adopts a novel cooperative-yet-adversarial learning strategy, which uses the discriminator to guide and couple the learning process of the policy and dynamics model, resulting in improved model performance and robustness. Our framework can also be extended to the case when demonstrations contain a large proportion of suboptimal data. Experimental results show that DMIL and its extension achieve superior performance and robustness compared to state-of-the-art offline IL methods under small datasets.
翻译:离线模拟学习(IL)是解决专家无报酬标签演示产生的决策问题的有力方法; 现有的离线模拟方法在有限的专家数据下出现严重性能退化; 包括一个学习的动态模型有可能改善专家数据的州-行动空间覆盖范围,然而,它也面临一些具有挑战性的问题,如模型近似/概括性错误和推出数据的不理想性。 在本文件中,我们提议了以差异为指南的基于模型的离线脱线学习(DMIL)框架,该框架引入了一种区分器,以同时区分模型推出数据的动态正确性和亚优度与实际专家演示相比。 DMIL采用一种新的合作式对立式学习战略,利用歧视器指导并结合政策和动态模型的学习过程,从而改进模型性能和稳健性。我们的框架还可以扩大到演示包含大量亚优性数据的情况。 实验结果显示,DMIL及其扩展与小型数据设置的状态离线下IL方法相比,取得了优异性能和稳健性。