利用在线贝叶斯推论进行终身递增强化学习 (Lifelong Incremental Reinforcement Learning with Online Bayesian Inference)

A central capability of a long-lived reinforcement learning (RL) agent is to incrementally adapt its behavior as its environment changes, and to incrementally build upon previous experiences to facilitate future learning in real-world scenarios. In this paper, we propose LifeLong Incremental Reinforcement Learning (LLIRL), a new incremental algorithm for efficient lifelong adaptation to dynamic environments. We develop and maintain a library that contains an infinite mixture of parameterized environment models, which is equivalent to clustering environment parameters in a latent space. The prior distribution over the mixture is formulated as a Chinese restaurant process (CRP), which incrementally instantiates new environment models without any external information to signal environmental changes in advance. During lifelong learning, we employ the expectation maximization (EM) algorithm with online Bayesian inference to update the mixture in a fully incremental manner. In EM, the E-step involves estimating the posterior expectation of environment-to-cluster assignments, while the M-step updates the environment parameters for future learning. This method allows for all environment models to be adapted as necessary, with new models instantiated for environmental changes and old models retrieved when previously seen environments are encountered again. Experiments demonstrate that LLIRL outperforms relevant existing methods, and enables effective incremental adaptation to various dynamic environments for lifelong learning.

翻译：长期强化学习(RL)动力的核心能力是随着环境变化逐步调整其行为,并逐步利用以往的经验,以促进在现实世界情景中的未来学习。在本文中,我们提出生命龙递增强化学习(LLIRL),这是高效终身适应动态环境的新的渐进算法。我们开发并维持一个图书馆,其中包含无限的参数化环境模型组合,这相当于在潜伏空间中将环境参数组合在一起。混合物的先前分布是作为中国餐厅流程(CRP)制定的,该流程在没有任何外部信息以预发环境变化信号的情况下,使新的环境模型逐渐即刻化。在终身学习期间,我们采用期望最大化(EM)算法,与网上Bayesian推论以完全渐进的方式更新混合物。在EM中,电子步骤涉及估计环境对集群分配的后期期望,而M阶段则更新环境参数,供今后学习使用。这一方法允许所有环境模型都进行必要的调整,同时采用新的快速环境变化模型和在以前看到的环境环境中检索的旧模型。我们利用了预期最大化算法,以便再次体验各种动态环境。