In this paper, we investigate the problem of the inverse reinforcement learning (IRL), especially the beyond-demonstrator (BD) IRL. The BD-IRL aims to not only imitate the expert policy but also extrapolate BD policy based on finite demonstrations of the expert. Currently, most of the BD-IRL algorithms are two-stage, which first infer a reward function then learn the policy via reinforcement learning (RL). Because of the two separate procedures, the two-stage algorithms have high computation complexity and lack robustness. To overcome these flaw, we propose a BD-IRL framework entitled hybrid adversarial inverse reinforcement learning (HAIRL), which successfully integrates the imitation and exploration into one procedure. The simulation results show that the HAIRL is more efficient and robust when compared with other similar state-of-the-art (SOTA) algorithms.
翻译:在本文中,我们调查反强化学习(IRL)的问题,特别是外部演示(BD) IRL。 BD-IRL的目的不仅是模仿专家政策,而且还根据专家的有限演示推断BD政策。目前,大多数BD-IRL算法是两阶段的,首先推论奖励功能,然后通过强化学习(RL)学习政策。由于两个不同的程序,两阶段算法的计算复杂程度高,缺乏可靠性。为了克服这些缺陷,我们提议了一个名为“混合对抗性反强化学习(HAIRL)”的BD-IRL框架,该框架成功地将模拟和探索纳入一个程序。模拟结果表明,与其他类似的高级算法相比,HAIRL更有效率和有力。