It has been a challenge to learning skills for an agent from long-horizon unannotated demonstrations. Existing approaches like Hierarchical Imitation Learning(HIL) are prone to compounding errors or suboptimal solutions. In this paper, we propose Option-GAIL, a novel method to learn skills at long horizon. The key idea of Option-GAIL is modeling the task hierarchy by options and train the policy via generative adversarial optimization. In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent. We theoretically prove the convergence of the proposed algorithm. Experiments show that Option-GAIL outperforms other counterparts consistently across a variety of tasks.
翻译:长期横向模拟演示的代理人学习技能是一项挑战。现有办法,如等级模拟学习(HIL)容易造成错误或非最佳解决方案的复合。在本文中,我们提议了在长期学习技能的新办法,即选择GAIL,这是在长期学习技能的新办法。选择GAIL的关键想法是用各种选项来模拟任务等级,并通过基因化对抗优化来培训政策。特别是,我们提议了期望-最大化(EM)式算法:一个E级步骤,根据当前学习的政策来抽样专家的选择,以及一个M级步骤,同时更新低层次和高级的代理政策,以尽量减少新提议的专家与代理之间的选择占用计量。我们理论上证明拟议的算法是趋同的。实验表明,选择GAIL在许多任务中总是比其他对应方更加一致。