Masked language modeling (MLM) has been widely used for pre-training effective bidirectional representations, but incurs substantial training costs. In this paper, we propose a novel concept-based curriculum masking (CCM) method to efficiently pre-train a language model. CCM has two key differences from existing curriculum learning approaches to effectively reflect the nature of MLM. First, we introduce a carefully-designed linguistic difficulty criterion that evaluates the MLM difficulty of each token. Second, we construct a curriculum that gradually masks words related to the previously masked words by retrieving a knowledge graph. Experimental results show that CCM significantly improves pre-training efficiency. Specifically, the model trained with CCM shows comparative performance with the original BERT on the General Language Understanding Evaluation benchmark at half of the training cost.
翻译:蒙面语言模型(MLM)已被广泛用于培训前有效的双向演示,但需要大量培训费用。在本文中,我们建议采用基于概念的新课程遮掩法(CCMM),以便有效地对语言模型进行预先培训。CCM与现有的课程学习方法有两大不同之处,以有效反映MLM的性质。首先,我们引入了精心设计的语言困难标准,以评价MLM的每个标志的困难。第二,我们通过检索知识图表,逐步将以前蒙面语言的词句遮掩起来。实验结果表明,CMB大大提高了培训前的效率。具体地说,在培训中培训的模型显示,与通用语言理解评价基准的原始BERT相比,培训成本的一半是相对效绩。