Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address this problem, we introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer, that transfers models from a source language, for which pretrained models are publicly available, like English, to a new target language. As opposed to prior work, which focused on the cross-lingual transfer between two languages, we extend the transfer to the model size. Given a pretrained model in a source language, we aim for a same-sized model in a target language. Instead of training a model from scratch, we exploit a smaller model that is in the target language but requires much fewer resources. Both small and source models are then used to initialize the token embeddings of the larger model based on the overlapping vocabulary of the source and target language. All remaining weights are reused from the model in the source language. This approach outperforms the sole cross-lingual transfer and can save up to 80% of the training steps compared to the random initialization.
翻译:多数变换语言模式主要在英语文本上经过预先培训,限制了其他语言的使用。随着模型规模的扩大,英语与其他语言之间的性能差距在计算较少,数据资源进一步增加。因此,需要更具资源效率的培训方法,以弥补可用资源较少的语文之间的差距。为了解决这一问题,我们采用了一种跨语言和渐进的转移学习方法,称为CLP-Transfer,即从一种源语言转让模型,这种源语言的预先培训模式可以公开提供,例如英语,到一种新的目标语言。与以前的工作相比,这种模式侧重于两种语言之间的跨语言转让,我们把这种转让扩大到模型的规模。鉴于一种源语言的预培训模式,我们的目标是在一种目标语言中采用一个同样大小的模式。我们从零开始培训,而是利用一种使用目标语言的较小和逐步转移模式,但所需要的资源要少得多。然后使用两种小型和源模式来启动基于源和目标语言重叠的词汇的较大模型的象征性嵌套。所有剩余重量都可以从源语言的模型中重新利用。这一方法将80种语言的初始语言转换为单一的跨语言化步骤。这一方法将自动转换为80种语言。