Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.
翻译:与单语模式相比,跨语言模式通常需要一种更清晰的词汇来充分代表所有语言。我们发现,由于词汇能力有限,许多语言在最近的跨语言模式中的代表性不足。为此,我们建议采用VoCap算法来确定每种语言的词汇能力;但是,提高词汇规模大大减缓了培训前的速度。为了解决问题,我们提议以k-NN为基点进行目标抽样,以加快昂贵的软体模。我们的实验表明,与VoCap学习的多语言词汇有利于跨语言语言模式的培训前。此外,基于k-NN的目标抽样减少了增加词汇规模的副作用,同时实现了可比的性能和更快的培训前速度。该代码和预先培训的多语言词汇可在https://github.com/bozheng-hit/VoCapXLM上查阅。