Imitation learning is a class of promising policy learning algorithms that is free from many practical issues with reinforcement learning, such as the reward design issue and the exploration hardness. However, the current imitation algorithm struggles to achieve both high performance and high in-environment sample efficiency simultaneously. Behavioral Cloning (BC) does not need in-environment interactions, but it suffers from the covariate shift problem which harms its performance. Adversarial Imitation Learning (AIL) turns imitation learning into a distribution matching problem. It can achieve better performance on some tasks but it requires a large number of in-environment interactions. Inspired by the recent success of EfficientZero in RL, we propose EfficientImitate (EI), a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously. Our algorithmic contribution in this paper is two-fold. First, we extend AIL into the MCTS-based RL. Second, we show the seemingly incompatible two classes of imitation algorithms (BC and AIL) can be naturally unified under our framework, enjoying the benefits of both. We benchmark our method not only on the state-based DeepMind Control Suite, but also on the image version which many previous works find highly challenging. Experimental results show that EI achieves state-of-the-art results in performance and sample efficiency. EI shows over 4x gain in performance in the limited sample setting on state-based and image-based tasks and can solve challenging problems like Humanoid, where previous methods fail with small amount of interactions. Our code is available at https://github.com/zhaohengyin/EfficientImitate.
翻译:光学学习是一个充满希望的政策学习算法的类别,这种算法可以摆脱许多与强化学习相关的实际问题,例如奖赏设计问题和探索难度等。然而,目前模仿算法在同时实现高性能和环境内高采样效率方面挣扎。行为性克隆(BC)不需要在环境互动中同时达到高性能和高环境采样效率。行为性克隆(BC)不需要在环境互动中进行,但是它受到会损害其业绩的共变转变问题的困扰。反模仿学习(AIL)将模仿学习变成一个分配匹配问题。它可以在某些任务上取得更好的业绩,但需要大量的环境互动。受最近节能Zero在RL的成功启发,我们提议节能性模拟(EI),这是一种基于规划性的模拟学习方法,可以同时实现高环境采样效率和业绩。我们本文中的算法贡献是2倍的。首先,我们把AILLA推广到以MC为主的RL。第二,我们展示两种基于模拟的采样算法似乎不相容的两种方法(BC和AIL)在我们的框架下可以自然统一,但同时享有深度的效益的好处。我们还在深度EILILA中也能够在显示高度的成绩上显示高性能的成绩的成绩上显示的成绩的成绩的成绩的成绩的成绩的成绩上的成绩。我们只是显示的成绩的成绩上的成绩,在4的成绩的成绩的成绩上的成绩的成绩。我们只显示的成绩的成绩的成绩的成绩上的成绩的成绩的成绩的成绩。