While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.
翻译:随着当前的掩盖式语言模型(LMs)训练集日益庞大,我们在此探讨了将训练缩小到一个适度大小但又代表性、平衡且公开可用的英文文本来源——英国国家语料库的效果。我们展示了在这个谨慎策划的语料库上进行预训练可以达到比原始BERT模型更好的性能。我们认为这种语料库具有作为语言建模基准的巨大潜力。为展示这种潜力,我们提出了一个公正、可重复和数据高效的LM架构比较研究,其中我们评估了几种训练目标和模型架构,并以系统化的方式复制了以前的实证结果。我们提出了一种优化的LM架构,称为LTG-BERT。