One approach to meet the challenges of deep lifelong reinforcement learning (LRL) is careful management of the agent's learning experiences, to learn (without forgetting) and build internal meta-models (of the tasks, environments, agents, and world). Generative replay (GR) is a biologically inspired replay mechanism that augments learning experiences with self-labelled examples drawn from an internal generative model that is updated over time. We present a version of GR for LRL that satisfies two desiderata: (a) Introspective density modelling of the latent representations of policies learned using deep RL, and (b) Model-free end-to-end learning. In this paper, we study three deep learning architectures for model-free GR, starting from a na\"ive GR and adding ingredients to achieve (a) and (b). We evaluate our proposed algorithms on three different scenarios comprising tasks from the Starcraft-2 and Minigrid domains. We report several key findings showing the impact of the design choices on quantitative metrics that include transfer learning, generalization to unseen tasks, fast adaptation after task change, performance wrt task expert, and catastrophic forgetting. We observe that our GR prevents drift in the features-to-action mapping from the latent vector space of a deep RL agent. We also show improvements in established lifelong learning metrics. We find that a small random replay buffer significantly increases the stability of training. Overall, we find that "hidden replay" (a well-known architecture for class-incremental classification) is the most promising approach that pushes the state-of-the-art in GR for LRL and observe that the architecture of the sleep model might be more important for improving performance than the types of replay used. Our experiments required only 6% of training samples to achieve 80-90% of expert performance in most Starcraft-2 scenarios.
翻译:迎接深层次终身强化学习(LRL)挑战的一种方法是仔细管理代理人的学习经验,(不忘)学习(不忘)和建设(任务、环境、代理人和世界的)内部缓冲模型。 催生回放(GR)是一种生物启发的回放机制,它通过从内部自贴标签模式中提取的、并随时间更新的自贴范例来增加学习经验。我们为LRL提供了一个版本的GR,它满足了两个偏斜:(a) 仔细管理代理人的学习经验,学习(不忘)和建设(b) 无模型的缓冲式密度模型,学习(不忘)和(b) 建立内部缓冲式的元模-(不忘) 建立内部的元模- (GR) 重塑(GR) (GR) (GR) 3个深层次的学习结构,从感动型开始,增加成型的成型(a) 和(b) 我们仅评估由 Starcraft-2 和Minigrid 域域域域域域中的任务任务任务任务任务的任务。 我们发现一些设计选择对量化指标选择的影响的影响, 包括转移学习、一般任务、一般任务、快速调整后快速调整后快速适应、业绩,我们总级的模型的模型的模型的升级的改进。我们观察了80进动变变动的模型的模型,我们观察了方向的变动的变动的变动的变动的模型,我们的变动的模型,在不断变动的变动的变动的变换变动的变动的变换变动的变动的变动的变动的变动的变。