The pre-trained language models have achieved great successes in various natural language understanding (NLU) tasks due to its capacity to capture the deep contextualized information in text by pre-training on large-scale corpora. One of the fundamental components in pre-trained language models is the vocabulary, especially for training multilingual models on many different languages. In the technical report, we present our practices on training multilingual pre-trained language models with BBPE: Byte-Level BPE (i.e., Byte Pair Encoding). In the experiment, we adopted the architecture of NEZHA as the underlying pre-trained language model and the results show that NEZHA trained with byte-level subwords consistently outperforms Google multilingual BERT and vanilla NEZHA by a notable margin in several multilingual NLU tasks. We release the source code of our byte-level vocabulary building tools and the multilingual pre-trained language models.
翻译:经过培训的语文模式在各种自然语言理解(NLU)任务中取得了巨大成功,这是因为它有能力通过对大型公司进行预先培训,在文本中捕捉深背景信息。经过培训的语文模式的基本组成部分之一是词汇,特别是培训多种语文的多语种模式。在技术报告中,我们介绍了我们培训多语种预先培训语言模式的做法,与BBPPE:BBBPE:BYte-level BPE(即Byte Pair Encoding)合作。在实验中,我们采用NEZHA的架构作为基本的经过培训的预先语言模式,结果显示NEZHA用字组的字组培训始终超越了Google多语种BERT和Vanilla NEZHA的字组,在若干多语种国家语言网络任务中有一个显著的优势。我们发布了我们的字组词汇建设工具和经过培训的多语种语言模式的来源代码。