A key challenge of continual reinforcement learning (CRL) in dynamic environments is to promptly adapt the RL agent's behavior as the environment changes over its lifetime, while minimizing the catastrophic forgetting of the learned information. To address this challenge, in this article, we propose DaCoRL, i.e., dynamics-adaptive continual RL. DaCoRL learns a context-conditioned policy using progressive contextualization, which incrementally clusters a stream of stationary tasks in the dynamic environment into a series of contexts and opts for an expandable multihead neural network to approximate the policy. Specifically, we define a set of tasks with similar dynamics as an environmental context and formalize context inference as a procedure of online Bayesian infinite Gaussian mixture clustering on environment features, resorting to online Bayesian inference to infer the posterior distribution over contexts. Under the assumption of a Chinese restaurant process prior, this technique can accurately classify the current task as a previously seen context or instantiate a new context as needed without relying on any external indicator to signal environmental changes in advance. Furthermore, we employ an expandable multihead neural network whose output layer is synchronously expanded with the newly instantiated context, and a knowledge distillation regularization term for retaining the performance on learned tasks. As a general framework that can be coupled with various deep RL algorithms, DaCoRL features consistent superiority over existing methods in terms of the stability, overall performance and generalization ability, as verified by extensive experiments on several robot navigation and MuJoCo locomotion tasks.
翻译:在动态环境中持续强化学习(CRL)的一个关键挑战是,随着环境在其一生中的变化,迅速适应RL代理机构的行为,同时尽量减少对所学到信息的灾难性遗忘。为了应对这一挑战,我们在本篇文章中提议DaCoRL,即动态适应性持续RL。DaCoRL采用渐进背景化,将动态环境中的一系列固定任务逐步归为一系列背景化,并选择扩大多头神经网络,以接近政策。具体地说,我们定义了一系列具有类似环境背景的动态任务,并将背景推断正规化为在线Bayesian无限高斯混合组合环境特征的程序,使用在线Bayesian,即动态适应性持续持续更新RL。DacoRRL学习了一种符合环境条件的政策。根据之前中国餐厅流程的假设,这一技术可以精确地将当前任务归类为先前所看到的背景或需要的即时空新环境,而无需依赖任何外部指标来预示环境变化。此外,我们使用一个可扩展性多头神经结构的逻辑化背景,在最新稳定化过程中,通过不断更新的常规水平框架,可以保持现有稳定性,同时保持现有流程。