A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks -- including on tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
翻译:蒙面语言模型(MLM)培训前的出色表现的一个可能解释是,这些模型已经学会了代表古典NLP管道中普遍存在的综合结构。在本文中,我们提出一个不同的解释:MLM成功完成下游任务几乎完全是由于他们有能力模拟较高顺序的单词共发统计数据。为了证明这一点,我们用随机打乱的单词顺序对MLM进行判决前培训,并表明这些模型在对许多下游任务进行微调后仍然达到很高的准确性 -- -- 包括对无视单词顺序的模型特别设计的具有挑战性的任务。我们的模式在一些参数合成调查中表现得令人惊讶,这表明我们测试合成信息代表的方式可能存在缺陷。总体而言,我们的结果显示,纯分发信息在很大程度上解释了培训前成功的原因,并强调了对需要更深入语言知识的评估数据集进行挑战的重要性。