We present a self-supervised learning framework, COCO-LM, that pretrains Language Models by COrrecting and COntrasting corrupted text sequences. Following ELECTRA-style pretraining, COCO-LM employs an auxiliary language model to corrupt text sequences, upon which it constructs two new tasks for pretraining the main model. The first token-level task, Corrective Language Modeling, is to detect and correct tokens replaced by the auxiliary model, in order to better capture token-level semantics. The second sequence-level task, Sequence Contrastive Learning, is to align text sequences originated from the same source input while ensuring uniformity in the representation space. Experiments on GLUE and SQuAD demonstrate that COCO-LM not only outperforms recent state-of-the-art pretrained models in accuracy, but also improves pretraining efficiency. It achieves the MNLI accuracy of ELECTRA with 50% of its pretraining GPU hours. With the same pretraining steps of standard base/large-sized models, COCO-LM outperforms the previous best models by 1+ GLUE average points.
翻译:我们提出了一个自我监督的学习框架,即COCO-LM,它通过腐蚀和腐蚀的文本序列对语言模型进行前导。在ELECTRA式的预培训后,COCO-LM使用一种辅助语言模型来腐蚀文本序列,它根据这种模式为主模型的预培训设计了两项新的任务。第一个象征性任务,即纠正语言模型,是检测和纠正被辅助模型取代的符号,以便更好地捕捉象征性的语义学。第二个序列级任务,即序列对立学习,是调和来自同一来源的文本序列,同时确保代表空间的统一性。关于GLUE和SuAD的实验表明,COCO-LM不仅在准确性地比最新的最新状态的预先培训模型更强,而且还提高了培训前的效率。它用50%的预培训GPUM时间实现了ELECTRA的 MLI精度。在标准基础/大型模型的预培训前步骤中,COCO-LM平均点以GLUEM的原模型排出GUA。