同一州,不同任务:无干扰的持续加强学习 (Same State, Different Task: Continual Reinforcement Learning without Interference)

Continual Learning (CL) considers the problem of training an agent sequentially on a set of tasks while seeking to retain performance on all previous tasks. A key challenge in CL is catastrophic forgetting, which arises when performance on a previously mastered task is reduced when learning a new task. While a variety of methods exist to combat forgetting, in some cases tasks are fundamentally incompatible with each other and thus cannot be learnt by a single policy. This can occur, in reinforcement learning (RL) when an agent may be rewarded for achieving different goals from the same observation. In this paper we formalize this ``interference'' as distinct from the problem of forgetting. We show that existing CL methods based on single neural network predictors with shared replay buffers fail in the presence of interference. Instead, we propose a simple method, OWL, to address this challenge. OWL learns a factorized policy, using shared feature extraction layers, but separate heads, each specializing on a new task. The separate heads in OWL are used to prevent interference. At test time, we formulate policy selection as a multi-armed bandit problem, and show it is possible to select the best policy for an unknown task using feedback from the environment. The use of bandit algorithms allows the OWL agent to constructively re-use different continually learnt policies at different times during an episode. We show in multiple RL environments that existing replay based CL methods fail, while OWL is able to achieve close to optimal performance when training sequentially.

翻译：持续学习(CL) 考虑按顺序对代理人进行一系列任务培训的问题,同时设法保留以往所有任务的业绩。 CL 中的一个关键挑战是灾难性的忘记,当学习新任务时,对以前掌握的任务的绩效减少时,就会出现灾难性的忘记。虽然有各种方法可以打击忘记,但在有些情况下,任务从根本上互不相容,因此无法用单一的政策来学习。这可以在强化学习(RL)中发生,因为一个代理人可能因实现与同一观察不同的目标而得到奖励。在本文中,我们正式确定“干涉”与忘记问题不同。我们显示,基于单一神经网络预测器的现有CL方法在出现干扰时,与共享的重新弹出缓冲器不同。相反,我们提出了一种简单的方法,即OWL,来克服这一挑战。OWL学习一种因因素而成因共分立的政策,而各自专门从事一项新任务。OWL 使用单独的头来防止干扰。在测试时,我们将政策作为多臂的强盗问题加以选择。我们以单一的神经网络预测方法,在有干扰的情况下,在反复选择一个未知的周期里,我们可以选择一种不相异的动作,在不同的操作中选择一个不规则,在不同的环境中选择一种不规则,在反复进行不相向不同的演变动。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【ICML2021】核持续学习，Kernel Continual Learning

专知会员服务

32+阅读 · 2021年7月15日

元学习(meta learning) 最新进展综述论文

专知会员服务

281+阅读 · 2020年5月8日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日