Pre-trained language models (PLMs) aim to learn universal language representations by conducting self-supervised training tasks on large-scale corpora. Since PLMs capture word semantics in different contexts, the quality of word representations highly depends on word frequency, which usually follows a heavy-tailed distributions in the pre-training corpus. Therefore, the embeddings of rare words on the tail are usually poorly optimized. In this work, we focus on enhancing language model pre-training by leveraging definitions of the rare words in dictionaries (e.g., Wiktionary). To incorporate a rare word definition as a part of input, we fetch its definition from the dictionary and append it to the end of the input text sequence. In addition to training with the masked language modeling objective, we propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions to enhance language modeling representation with dictionary. We evaluate the proposed Dict-BERT model on the language understanding benchmark GLUE and eight specialized domain benchmark datasets. Extensive experiments demonstrate that Dict-BERT can significantly improve the understanding of rare words and boost model performance on various NLP downstream tasks.
翻译:培训前语言模型(PLM)旨在学习通用语言代表,通过对大型语团进行自我监督的培训任务,学习通用语言代表。由于PLM在不同的环境下捕获了字语语语义,字语表达的质量高度取决于字频,通常在培训前材料中大量分发,因此,将稀有文字嵌入尾部通常没有很好地优化。在这项工作中,我们侧重于通过利用词典(例如Wiktionary)中稀有词词的定义来加强语言模式预培训。为了纳入稀有词定义作为投入的一部分,我们从字典中取取其定义并将其附在输入文本序列的结尾处。除了以隐含语言建模为目的的培训之外,我们提议两项关于输入文本序列和稀有的词义定义之间的新颖的自我监管前培训任务,以加强词典中的语言建模代表。我们评估了拟议中的Dict-BERT模式,以语言理解基准GLUE和8个专门的域基准数据集。广泛的实验表明,Dict-LERT能够大大改进各种稀有字的下游任务。