Episodic memory-based methods can rapidly latch onto past successful strategies by a non-parametric memory and improve sample efficiency of traditional reinforcement learning. However, little effort is put into the continuous domain, where a state is never visited twice, and previous episodic methods fail to efficiently aggregate experience across trajectories. To address this problem, we propose Generalizable Episodic Memory (GEM), which effectively organizes the state-action values of episodic memory in a generalizable manner and supports implicit planning on memorized trajectories. GEM utilizes a double estimator to reduce the overestimation bias induced by value propagation in the planning process. Empirical evaluation shows that our method significantly outperforms existing trajectory-based methods on various MuJoCo continuous control tasks. To further show the general applicability, we evaluate our method on Atari games with discrete action space, which also shows a significant improvement over baseline algorithms.
翻译:以记忆为基础的方法可以通过非参数内存迅速连接到过去的成功战略,并提高传统强化学习的样本效率。 但是,很少努力进入连续领域,在这个领域,一个州从未被访问过两次,先前的偶发方法未能有效地综合跨轨道的经验。为了解决这个问题,我们提议通用记忆(GEM),它能以可普遍接受的方式有效地组织分流内存的状态-行动值,并支持对默记轨迹进行隐性规划。 GEM使用双重估计器来减少规划过程中的价值观传播引起的过高估计偏差。 经验性评估表明,我们的方法大大超过穆乔科连续控制任务中现有的轨迹方法。为了进一步显示一般适用性,我们用离散动作空间来评估我们关于阿塔里游戏的方法,这也表明基线算法有了显著的改进。