Extrapolating beyond-demonstrator (BD) performance through the imitation learning (IL) algorithm aims to learn from and outperform the demonstrator. Most existing BDIL algorithms are performed in two stages by first inferring a reward function before learning a policy via reinforcement learning (RL). However, such two-stage BDIL algorithms suffer from high computational complexity, weak robustness, and large performance variations. In particular, a poor reward function derived in the first stage will inevitably incur severe performance loss in the second stage. In this work, we propose a hybrid adversarial imitation learning (HAIL) algorithm that is one-stage, model-free, generative-adversarial (GA) fashion and curiosity-driven. Thanks to the one-stage design, the HAIL can integrate both the reward function learning and the policy optimization into one procedure, which leads to many advantages such as low computational complexity, high robustness, and strong adaptability. More specifically, HAIL simultaneously imitates the demonstrator and explores BD performance by utilizing hybrid rewards. Extensive simulation results confirm that HAIL can achieve higher performance as compared to other similar BDIL algorithms.
翻译:通过模仿学习(IL)算法外推外推法外推法外演化(BD)性能通过模仿学习(IL)算法向演示人学习并优于演示人。多数现有的BDIL算法在通过强化学习(RL)学习一项政策之前先先推奖性功能,然后先推评奖励性功能。然而,这种两阶段的BDIL算法在计算复杂性高、稳健性强、性能差异大等方面都存在问题。特别是,第一阶段所得的微弱奖励性功能在第二阶段必然会造成严重的性能损失。在这项工作中,我们建议采用一种混合性对抗性模拟学(HAIL)算法,这种混合性模拟性(HAIL)算法是单阶段的、无模式的、基因-对抗性(GA)时尚的和好奇性驱动的。由于单阶段设计,HAIL可以将奖励性学习和政策优化纳入一个程序,从而带来许多好处,例如低计算性、强健性和强大的适应性。更具体地说,HAIL同时模仿示范性,并利用混合奖赏来探索BD的性能。广泛的模拟结果证实,HAILL能够比其他类似的BLDAD演算法性更高。