Model-free reinforcement learning (RL) requires a large number of trials to learn a good policy, especially in environments with sparse rewards. We explore a method to improve the sample efficiency when we have access to demonstrations. Our approach, Backplay, uses a single demonstration to construct a curriculum for a given task. Rather than starting each training episode in the environment's fixed initial state, we start the agent near the end of the demonstration and move the starting point backwards during the course of training until we reach the initial state. Our contributions are that we analytically characterize the types of environments where Backplay can improve training speed, demonstrate the effectiveness of Backplay both in large grid worlds and a complex four player zero-sum game (Pommerman), and show that Backplay compares favorably to other competitive methods known to improve sample efficiency. This includes reward shaping, behavioral cloning, and reverse curriculum generation.
翻译:无模型强化学习(RL)需要大量试验才能学习好的政策,特别是在奖赏稀少的环境中。 我们探索了一种方法来提高当我们有机会参加演示时的样本效率。 我们的方法, 背面游戏, 使用单一的演示来为特定任务构建课程。 我们不是在环境固定初始状态下开始每个训练集, 而是在演示结束时开始试办, 在培训过程中将起点向后移, 直到我们到达初始状态。 我们的贡献是分析后游可以提高培训速度的环境类型, 展示大网格世界和复杂的四玩家零和游戏( Pommerman) 的效果, 并显示后游比其他已知的提高样本效率的竞争方法要好。 这包括奖赏塑造、 行为克隆和反向课程生成。