Masked language models conventionally use a masking rate of 15% due to the belief that more masking would provide insufficient context to learn good representations, and less masking would make training too expensive. Surprisingly, we find that masking up to 40% of input tokens can outperform the 15% baseline, and even masking 80% can preserve most of the performance, as measured by fine-tuning on downstream tasks. Increasing the masking rates has two distinct effects, which we investigate through careful ablations: (1) A larger proportion of input tokens are corrupted, reducing the context size and creating a harder task, and (2) models perform more predictions, which benefits training. We observe that larger models in particular favor higher masking rates, as they have more capacity to perform the harder task. We also connect our findings to sophisticated masking schemes such as span masking and PMI masking, as well as BERT's curious 80-10-10 corruption strategy, and find that simple uniform masking with [MASK] replacements can be competitive at higher masking rates. Our results contribute to a better understanding of masked language modeling and point to new avenues for efficient pre-training.
翻译:蒙面语言模式通常使用15 % 的遮面率,因为人们相信,更多的遮面将无法提供足够的背景来学习良好的表现,而较少遮面则会使培训费用太高。 令人惊讶的是,我们发现,遮面高达40%的输入符号可能超过15%的基线,甚至遮面80%也能保存大部分的性能,这可以通过对下游任务的微调来衡量。 提高遮面率有两种不同的效果,我们通过仔细推理来调查:(1) 更多的输入符号被腐蚀,缩小了上下文大小,创造了更艰巨的任务;(2) 模型进行更多的预测,这有利于培训。我们发现,更大的模型特别有利于更高的遮面效果,因为它们有更大的能力来完成更艰巨的任务。 我们还将我们的调查结果与先进的遮面和PMI遮面计划联系起来,以及BERT的好奇的80-10-10腐败战略,并发现与[MASK]替代的简单统一遮面装置在更高的遮面率上具有竞争力。 我们的成果有助于更好地了解掩面语言建模和点对有效培训前的新途径的了解。