Continual Learning (CL) considers the problem of training an agent sequentially on a set of tasks while seeking to retain performance on all previous tasks. A key challenge in CL is catastrophic forgetting, which arises when performance on a previously mastered task is reduced when learning a new task. While a variety of methods exist to combat forgetting, in some cases tasks are fundamentally incompatible with each other and thus cannot be learnt by a single policy. This can occur, in reinforcement learning (RL) when an agent may be rewarded for achieving different goals from the same observation. In this paper we formalize this ``interference'' as distinct from the problem of forgetting. We show that existing CL methods based on single neural network predictors with shared replay buffers fail in the presence of interference. Instead, we propose a simple method, OWL, to address this challenge. OWL learns a factorized policy, using shared feature extraction layers, but separate heads, each specializing on a new task. The separate heads in OWL are used to prevent interference. At test time, we formulate policy selection as a multi-armed bandit problem, and show it is possible to select the best policy for an unknown task using feedback from the environment. The use of bandit algorithms allows the OWL agent to constructively re-use different continually learnt policies at different times during an episode. We show in multiple RL environments that existing replay based CL methods fail, while OWL is able to achieve close to optimal performance when training sequentially.
翻译:持续学习(CL) 考虑按顺序对代理人进行一系列任务培训的问题,同时设法保留以往所有任务的业绩。 CL 中的一个关键挑战是灾难性的忘记,当学习新任务时,对以前掌握的任务的绩效减少时,就会出现灾难性的忘记。虽然有各种方法可以打击忘记,但在有些情况下,任务从根本上互不相容,因此无法用单一的政策来学习。这可以在强化学习(RL)中发生,因为一个代理人可能因实现与同一观察不同的目标而得到奖励。在本文中,我们正式确定“干涉”与忘记问题不同。我们显示,基于单一神经网络预测器的现有CL方法在出现干扰时,与共享的重新弹出缓冲器不同。相反,我们提出了一种简单的方法,即OWL,来克服这一挑战。OWL学习一种因因素而成因共分立的政策,而各自专门从事一项新任务。OWL 使用单独的头来防止干扰。在测试时,我们将政策作为多臂的强盗问题加以选择。我们以单一的神经网络预测方法,在有干扰的情况下,在反复选择一个未知的周期里,我们可以选择一种不相异的动作,在不同的操作中选择一个不规则,在不同的环境中选择一种不规则,在反复进行不相向不同的演变动。