One approach to meet the challenges of deep lifelong reinforcement learning (LRL) is careful management of the agent's learning experiences, in order to learn (without forgetting) and build internal meta-models (of the tasks, environments, agents, and world). Generative replay (GR) is a biologically-inspired replay mechanism that augments learning experiences with self-labelled examples drawn from an internal generative model that is updated over time. In this paper, we present a version of GR for LRL that satisfies two desiderata: (a) Introspective density modelling of the latent representations of policies learned using deep RL, and (b) Model-free end-to-end learning. In this work, we study three deep learning architectures for model-free GR. We evaluate our proposed algorithms on three different scenarios comprising tasks from the StarCraft2 and Minigrid domains. We report several key findings showing the impact of the design choices on quantitative metrics that include transfer learning, generalization to unseen tasks, fast adaptation after task change, performance comparable to a task expert, and minimizing catastrophic forgetting. We observe that our GR prevents drift in the features-to-action mapping from the latent vector space of a deep actor-critic agent. We also show improvements in established lifelong learning metrics. We find that the introduction of a small random replay buffer is needed to significantly increase the stability of training, when used in conjunction with the replay buffer and the generated replay buffer. Overall, we find that "hidden replay" (a well-known architecture for class-incremental classification) is the most promising approach that pushes the state-of-the-art in GR for LRL.
翻译:应对深层终身强化学习(LRL)挑战的一种方法是仔细管理代理人的缓冲学习经验,以便学习(不忘)和建设内部元模(任务、环境、代理人和世界)。 产生重放(GR)是一种生物激励型重放机制,它通过从内部基因化模型中自标示例(随着时间的推移不断更新)来增加学习经验。 在本文中,我们为LRL提供了一个版本的GR,它满足了两个偏差:(a) 利用深层RLL学习所学政策潜在表现的内向密度建模,(b) 无模型型端对端对端学习。在这个工作中,我们为无型重播(GR)研究三种深层次的学习结构。我们评估了我们提出的三种不同情景的算法,其中包括StarCraft2 和Minigrid域的任务。我们报告了若干关键的调查结果,显示设计选择对定量测量的影响,其中包括:转移学习、概括到最隐形的任务变化后快速适应,与任务专家相比,以及尽量减少灾难性的升级。我们发现,在深层级的升级的升级的升级的升级中,我们发现在不断演变动中也显示需要。