Reinforcement learning with sparse rewards is challenging because an agent can rarely obtain non-zero rewards and hence, gradient-based optimization of parameterized policies can be incremental and slow. Recent work demonstrated that using a memory buffer of previous successful trajectories can result in more effective policies. However, existing methods may overly exploit past successful experiences, which can encourage the agent to adopt sub-optimal and myopic behaviors. In this work, instead of focusing on good experiences with limited diversity, we propose to learn a trajectory-conditioned policy to follow and expand diverse past trajectories from a memory buffer. Our method allows the agent to reach diverse regions in the state space and improve upon the past trajectories to reach new states. We empirically show that our approach significantly outperforms count-based exploration methods (parametric approach) and self-imitation learning (parametric approach with non-parametric memory) on various complex tasks with local optima. In particular, without using expert demonstrations or resetting to arbitrary states, we achieve the state-of-the-art scores under five billion number of frames, on challenging Atari games such as Montezuma's Revenge and Pitfall.
翻译:利用以往成功轨迹的记忆缓冲可以导致更有效的政策;然而,现有方法可能会过度利用以往的成功经验,从而鼓励代理人采用亚最佳和近似行为;在这项工作中,我们提议学习一种以轨迹为条件的政策,以便从记忆缓冲中遵循和扩大不同的过去轨迹。我们的方法允许代理人进入州空间的不同区域,并改进过去的轨迹,以达到新的状态。我们从经验上表明,我们的方法大大超越了基于计数的探索方法(准度方法)和与当地选择项目的各种复杂任务(非准度记忆的对称方法)的自我模仿学习。特别是,我们不使用专家演示或对任意状态的重新校正,我们取得了低于50亿个框架的状态,即挑战性游戏(如蒙特苏马)和赌博(如Montezfallegall ) 。