Masked language models conventionally use a masking rate of 15% due to the belief that more masking would provide insufficient context to learn good representations, and less masking would make training too expensive. Surprisingly, we find that masking up to 40% of input tokens can outperform the 15% baseline, and even masking 80% can preserve most of the performance, as measured by finetuning on downstream tasks. Increasing the masking rates has two distinct effects, which we investigate through careful ablations: (1) A larger proportion of input tokens are corrupted, reducing the context size and creating a harder task, and (2) models perform more predictions, which benefits training. We observe that larger models with more capacity to tackle harder tasks in particular favor higher masking rates. We also find that even more sophisticated masking schemes such as span masking or PMI masking can benefit from higher masking rates, albeit to a smaller extent. Our results contribute to a better understanding of masked language modeling and shed light on more efficient language pre-training.
翻译:蒙面语言模式通常使用15 % 的遮面率, 这是因为人们认为更多的遮面将无法提供足够的背景来学习好的表现,而较少遮面则会使培训费用太高。 令人惊讶的是,我们发现遮面高达40%的输入符号可能超过15%的基线,甚至遮面80%也能保存大部分的性能,这可以通过对下游任务的微调来衡量。 提高遮面率有两种不同的效果,我们通过仔细推理来对此进行调查:(1) 更多的输入符号被腐蚀,缩小了背景大小,创造了更艰巨的任务;(2) 模型进行更多的预测,这有利于培训。我们观察到,更有能力处理更困难的任务的更大模型,特别是有利于更高的遮面率。 我们还发现,更复杂的遮面方案,如横遮面或PMI蒙面方案,可以从更高的遮面率中获益,尽管其程度较小。 我们的成果有助于更好地理解遮面语言的建模,并启发更有效的语言培训前课程。