Offline imitation learning (IL) is a powerful method to solve decision-making problems from expert demonstrations without reward labels. Existing offline IL methods suffer from severe performance degeneration under limited expert data due to covariate shift. Including a learned dynamics model can potentially improve the state-action space coverage of expert data, however, it also faces challenging issues like model approximation/generalization errors and suboptimality of rollout data. In this paper, we propose the Discriminator-guided Model-based offline Imitation Learning (DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations. DMIL adopts a novel cooperative-yet-adversarial learning strategy, which uses the discriminator to guide and couple the learning process of the policy and dynamics model, resulting in improved model performance and robustness. Our framework can also be extended to the case when demonstrations contain a large proportion of suboptimal data. Experimental results show that DMIL and its extension achieve superior performance and robustness compared to state-of-the-art offline IL methods under small datasets.
翻译:离线模拟学习(IL)是解决专家无报酬标签演示产生的决策问题的有力方法。现有的离线国际学习方法由于共变换,在有限的专家数据下出现严重性能退化;包括一个学习的动态模型有可能改善专家数据的州-行动空间覆盖范围,然而,它也面临一些具有挑战性的问题,如模型近似/一般化错误和推出数据的不最优化性。在本文件中,我们建议采用分辨器-以模型为基础的离线模拟学习(DMIL)框架,该框架引入一个歧视器,同时区分模型推出数据对实际专家演示的动态正确性和亚优度。DMIL采用一种新的合作性对抗性对抗性学习战略,利用歧视器指导并结合政策和动态模型的学习过程,从而改进模型性能和稳健性。我们的框架还可以扩展到演示包含大量亚优性数据的情况。实验结果表明,DMIL及其扩展在小型数据设置下,与状态式离线国际学习方法相比,其性能和稳健性。