Though achieving impressive results on many NLP tasks, the BERT-like masked language models (MLM) encounter the discrepancy between pre-training and inference. In light of this gap, we investigate the contextual representation of pre-training and inference from the perspective of word probability distribution. We discover that BERT risks neglecting the contextual word similarity in pre-training. To tackle this issue, we propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity. By predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, the word similarity can be explicitly modeled. We design two architectures for GR-BERT and evaluate our model in downstream tasks. Experimental results show that the gloss regularizer benefits BERT in word-level and sentence-level semantic representation. The GR-BERT achieves new state-of-the-art in lexical substitution task and greatly promotes BERT sentence representation in both unsupervised and supervised STS tasks.
翻译:尽管在许多国家语言规划任务上取得了令人印象深刻的成果,但类似BERT的隐蔽语言模型(MLM)遇到了培训前和推断之间的差异。鉴于这一差距,我们从字概率分布的角度来调查培训前和推断的背景说明。我们发现,在培训前,BERT有可能忽视相关词的相似性。为了解决这一问题,我们提议为BERT的预培训前(GR-BERT)建立一个辅助光谱校正模块,以强化字词义相似性。通过预测隐蔽单词和将背景嵌入与相应的格子相匹配,可以明确模拟类似性。我们为GR-BERT设计了两个结构,并在下游任务中评估了我们的模型。实验结果表明,在字级和句级的语义代表中,RERT将有利于BERT。G-BERT在词汇替换任务中实现了新的状态,并极大地促进了BERT在不受监督和监管的STS任务中的代表性。