Deep reinforcement learning (DRL) has been demonstrated to be effective for several complex decision-making applications such as autonomous driving and robotics. However, DRL is notoriously limited by its high sample complexity and its lack of stability. Prior knowledge, e.g. as expert demonstrations, is often available but challenging to leverage to mitigate these issues. In this paper, we propose General Reinforced Imitation (GRI), a novel method which combines benefits from exploration and expert data and is straightforward to implement over any off-policy RL algorithm. We make one simplifying hypothesis: expert demonstrations can be seen as perfect data whose underlying policy gets a constant high reward. Based on this assumption, GRI introduces the notion of offline demonstration agents. This agent sends expert data which are processed both concurrently and indistinguishably with the experiences coming from the online RL exploration agent. We show that our approach enables major improvements on vision-based autonomous driving in urban environments. We further validate the GRI method on Mujoco continuous control tasks with different off-policy RL algorithms. Our method ranked first on the CARLA Leaderboard and outperforms World on Rails, the previous state-of-the-art, by 17%.
翻译:深度强化学习(DRL)已证明对自主驾驶和机器人等若干复杂的决策应用(DRL)是有效的。然而,DRL由于其高样本复杂性和不稳定性而臭名昭著地受到限制。先前的知识,例如专家演示,往往可用,但很难解决这些问题。在本文件中,我们提议采用通用强化仿真(GRI)这一创新方法,将勘探和专家数据的好处结合起来,并直接用于执行任何非政策性RL算法。我们提出了一个简化的假设:专家演示可被视为完美数据,其基本政策不断获得高额奖励。基于这一假设,GRI提出了脱机示范剂的概念。该代理发送专家数据,这些数据既与在线RL勘探代理的经验同时处理,又不可分割。我们表明,我们的方法能够大大改进城市环境中基于愿景的自主驾驶。我们进一步验证了Mujoco连续控制任务GRI方法,同时使用不同的离政策RL算法。我们的方法在CARA头板上排名第一,在17岁的铁路上超越了世界。