Deep reinforcement learning (DRL) has achieved great successes in many simulated tasks. The sample inefficiency problem makes applying traditional DRL methods to real-world robots a great challenge. Generative Adversarial Imitation Learning (GAIL) -- a general model-free imitation learning method, allows robots to directly learn policies from expert trajectories in large environments. However, GAIL shares the limitation of other imitation learning methods that they can seldom surpass the performance of demonstrations. In this paper, to address the limit of GAIL, we propose GAN-Based Interactive Reinforcement Learning (GAIRL) from demonstration and human evaluative feedback by combining the advantages of GAIL and interactive reinforcement learning. We tested our proposed method in six physics-based control tasks, ranging from simple low-dimensional control tasks -- Cart Pole and Mountain Car, to difficult high-dimensional tasks -- Inverted Double Pendulum, Lunar Lander, Hopper and HalfCheetah. Our results suggest that with both optimal and suboptimal demonstrations, a GAIRL agent can always learn a more stable policy with optimal or close to optimal performance, while the performance of the GAIL agent is upper bounded by the performance of demonstrations or even worse than it. In addition, our results indicate the reason that GAIRL is superior over GAIL is the complementary effect of demonstrations and human evaluative feedback.
翻译:深度强化学习( DRL) 在许多模拟任务中取得了巨大成功。 抽样低效问题使传统DRL方法应用于真实世界机器人成为一项巨大的挑战。 创用模拟模拟模拟学习(GAIL) -- -- 普通的无模型模仿学习(GAIL) -- -- 允许机器人直接从大型环境中的专家轨迹学习政策。 然而, GAIL 分享了其他模拟学习方法的局限性,而其他模拟学习方法很少能超过演示的绩效。 在本文中,为了解决GAIL的局限性,我们建议GAN-Base-Base互动强化学习(GAIRL)从演示和人类评价反馈中采用传统DRL方法,将GAIL和交互式强化学习的优势结合起来。 我们在六种基于物理的控制任务(GAIL)中测试了我们拟议的方法,从简单的低维控制任务 -- -- Cart Pol and Mountain Car,到困难的高度任务 -- -- Invercent Pentel Lander, Hoppper and LafChetah。 我们的结果表明, 通过最佳和次最佳的演示, GAIR代理总是学习更稳定的政策, 其表现比GAIL更差的比高级或最优。