Pretrained language models (PLMs) have shown marvelous improvements across various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence of characters, and completely ignore word information. Although Whole Word Masking can alleviate this, the semantics in words is still not well represented. In this paper, we revisit the segmentation granularity of Chinese PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both characters and words. To achieve this, we design objective functions for learning both character and word-level representations. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA performance on all these tasks. Further analysis demonstrates that words are semantically richer than characters. More interestingly, we show that MigBERT also works with Japanese. Our code and model have been released here~\footnote{https://github.com/xnliang98/MigBERT}.
翻译:摘要: 预训练语言模型 (PLM) 在各种自然语言处理任务中取得了巨大的进展。大多数中文预训练语言模型只是将输入的文本作为一个单字序列,并完全忽略了词语信息。虽然采用了全词掩码技术,但是词语中的语义仍然没有得到很好的表示。在本文中,我们重访了面向中文预训练语言模型的分词粒度问题。我们提出了一种混合粒度中文BERT模型(MigBERT),同时考虑了单字和词语信息。为了实现这一点,我们设计了学习单字和词级表示的目标函数。我们在各种中文NLP任务上进行了大量实验,以评估现有的预训练语言模型以及提出的MigBERT。实验结果表明,MigBERT在所有这些任务上都取得了新的最佳结果。进一步分析表明,词语比单字的语义更加丰富。更有趣的是,我们展示了MigBERT 在日本语数据上也有良好的表现。我们的代码和模型在此处发布~\footnote{https://github.com/xnliang98/MigBERT}。