Episodic memory-based methods can rapidly latch onto past successful strategies by a non-parametric memory and improve sample efficiency of traditional reinforcement learning. However, little effort is put into the continuous domain, where a state is never visited twice and previous episodic methods fail to efficiently aggregate experience across trajectories. To address this problem, we propose Generalizable Episodic Memory (GEM), which effectively organizes the state-action values of episodic memory in a generalizable manner and supports implicit planning on memorized trajectories. GEM utilizes a double estimator to reduce the overestimation bias induced by value propagation in the planning process. Empirical evaluation shows that our method significantly outperforms existing trajectory-based methods on various MuJoCo continuous control tasks. To further show the general applicability, we evaluate our method on Atari games with discrete action space, which also shows significant improvement over baseline algorithms.
翻译:以记忆为基础的方法可以通过非参数内存迅速连接到过去的成功战略,提高传统强化学习的样本效率。然而,很少努力进入连续领域,即一个州从未被访问过两次,先前的偶发方法未能有效地综合跨轨的经验。为解决这一问题,我们提议通用记忆(GEM),该方法以可普遍接受的方式有效地组织分流内存的状态-行动值,并支持对内流轨迹进行隐性规划。 GEM使用双重估计工具来减少规划过程中因价值传播引起的高估偏差。 经验性评估显示,我们的方法大大优于穆乔科各连续控制任务的现有轨迹方法。为了进一步显示一般适用性,我们用离散行动空间来评估我们的阿塔里游戏方法,这也表明基线算法有了显著的改进。