Recent progress in state-only imitation learning extends the scope of applicability of imitation learning to real-world settings by relieving the need for observing expert actions. However, existing solutions only learn to extract a state-to-action mapping policy from the data, without considering how the expert plans to the target. This hinders the ability to leverage demonstrations and limits the flexibility of the policy. In this paper, we introduce Decoupled Policy Optimization (DePO), which explicitly decouples the policy as a high-level state planner and an inverse dynamics model. With embedded decoupled policy gradient and generative adversarial training, DePO enables knowledge transfer to different action spaces or state transition dynamics, and can generalize the planner to out-of-demonstration state regions. Our in-depth experimental analysis shows the effectiveness of DePO on learning a generalized target state planner while achieving the best imitation performance. We demonstrate the appealing usage of DePO for transferring across different tasks by pre-training, and the potential for co-training agents with various skills.
翻译:仅由国家进行的模仿学习最近的进展通过减轻观察专家行动的需要,扩大了模仿学习在现实世界环境中的应用范围,但现有解决办法只能从数据中获取从州到行动的绘图政策,而没有考虑专家如何计划目标。这妨碍了利用示范和限制政策灵活性的能力。在本文件中,我们引入了脱钩政策优化(DePO),这明显地使该政策作为一个高级国家规划员和反向动态模型脱钩。在嵌入了脱钩的政策梯度和基因对抗性培训之后,DEPO使知识能够转移到不同的行动空间或州过渡动态,并能够将规划员推广到超越示范状态的区域。我们的深入实验分析显示,DEPO在学习通用目标州规划员的同时取得最佳的模仿性表现的有效性。我们展示了DEPO在通过培训前转让不同任务方面的吸引力,以及具有各种技能的共同培训代理人的潜力。