Recent research has revealed that neural language models at scale suffer from poor temporal generalization capability, i.e., the language model pre-trained on static data from past years performs worse over time on emerging data. Existing methods mainly perform continual training to mitigate such a misalignment. While effective to some extent but is far from being addressed on both the language modeling and downstream tasks. In this paper, we empirically observe that temporal generalization is closely affiliated with lexical semantic change, which is one of the essential phenomena of natural languages. Based on this observation, we propose a simple yet effective lexical-level masking strategy to post-train a converged language model. Experiments on two pre-trained language models, two different classification tasks, and four benchmark datasets demonstrate the effectiveness of our proposed method over existing temporal adaptation methods, i.e., continual training with new data. Our code is available at \url{https://github.com/zhaochen0110/LMLM}.
翻译:最近的研究显示,大规模神经语言模型在时间上普遍化能力薄弱,即对过去几年的静态数据进行预先培训的语言模型随着时间推移而变得更糟;现有方法主要是持续培训,以减轻这种不匹配现象;虽然在某种程度上是有效的,但在语言建模和下游任务方面还远远没有得到处理;在本文件中,我们从经验上观察到,时间一般化与自然语言基本现象之一的词汇性语义变化密切相关。根据这一观察,我们提出了一个简单而有效的词汇级掩码战略,用于后统语言模型。关于两个预先培训的语言模型的实验,两个不同的分类任务,四个基准数据集,显示了我们拟议方法相对于现有时间适应方法的有效性,即持续培训新数据。我们的代码可在以下网址查阅:<url{https://github.com/zhauchen0110/LMLMM}。