Large pre-trained language models help to achieve state of the art on a variety of natural language processing (NLP) tasks, nevertheless, they still suffer from forgetting when incrementally learning a sequence of tasks. To alleviate this problem, recent works enhance existing models by sparse experience replay and local adaption, which yield satisfactory performance. However, in this paper we find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. To verify the ability of BERT to maintain old knowledge, we adopt and re-finetune single-layer probe networks with the parameters of BERT fixed. We investigate the models on two types of NLP tasks, text classification and extractive question answering. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of novel methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting.
翻译:然而,为了缓解这一问题,最近的工作通过经验稀少的重播和地方适应来提升现有模式,从而产生令人满意的效果。然而,在本文件中,我们发现,即使没有微薄的记忆回放,诸如BERT等经过预先训练的语言模式也具有按顺序学习的潜在能力。为了验证BERT保持旧知识的能力,我们采用和重新将BERT参数固定的单层探测网络。我们调查了两种类型的NLP任务的模式,即文本分类和采掘问题回答。我们的实验表明,BERT实际上能够长期、在极为稀少的重播甚至没有重播的情况下,为以前学到的任务产生高质量的表现。我们还引入了一系列新的方法来解释遗忘机制以及记忆演练如何在任务递增学习中发挥重要作用,从而弥合我们新的发现与以往关于灾难性的遗忘的研究之间的差距。</s>