Pretrained language models (PLMs) have shown marvelous improvements across various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence of characters, and completely ignore word information. Although Whole Word Masking can alleviate this, the semantics in words is still not well represented. In this paper, we revisit the segmentation granularity of Chinese PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both characters and words. To achieve this, we design objective functions for learning both character and word-level representations. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA performance on all these tasks. Further analysis demonstrates that words are semantically richer than characters. More interestingly, we show that MigBERT also works with Japanese. Our code has been released here~\footnote{\url{https://github.com/xnliang98/MigBERT}} and you can download our model here~\footnote{\url{https://huggingface.co/xnliang/MigBERT-large/}}.
翻译:预训练语言模型(PLMs)在各种NLP任务中显示出了巨大的改进。 大多数中文PLMs只将输入文本视为字符序列,并完全忽略单词信息。虽然全词遮罩可以缓解这个问题,但单词中的语义仍然无法得到很好的表示。 在本文中,我们重新审视了中文PLMs的分割粒度。 我们提出了一种混合粒度的中文BERT(MigBERT),同时考虑字符和单词。 为实现此目的,我们设计了学习字符和单词级表示的目标函数。 我们在各种中文NLP任务上进行了广泛的实验,以评估现有的PLMs以及所提出的MigBERT。 实验结果表明,MigBERT在所有这些任务上都实现了新的SOTA性能。 进一步的分析表明,单词在语义上比字符更丰富。更有趣的是,我们证明了MigBERT也适用于日语。 我们的代码已在此处发布~\footnote{\url { https://github.com/xnliang98/MigBERT}},您可以在此处下载我们的模型~\footnote{\url { https://huggingface.co/xnliang/MigBERT-large/}}。