Recently, large pretrained language models (LMs) have gained popularity. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a method -- called WECHSEL -- to transfer English models to new languages. We exchange the tokenizer of the English model with a tokenizer in the target language and initialize token embeddings such that they are close to semantically similar English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages (French, German, Chinese and Swahili). WECHSEL improves over a previously proposed method for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch in the target language with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.
翻译:最近,大量预先培训的语言模式(LMS)日益受到欢迎。培训这些模式需要越来越多的计算资源,现有模式大多仅用英文文本培训。用其他语言培训这些模式非常昂贵。为了缓解这一问题,我们采用了一种方法 -- -- 称为WECHSEL -- -- 将英语模式转换为新语言。我们用目标语言的象征符与标语交换英语模式的象征符号,并开始象征性嵌入,以便通过使用涵盖英语和目标语言的多语言静态词嵌入器,接近于语言上的类似英语符号。我们使用WECHSEL将GPT-2和RoBERTA模式转换为其他四种语言(法语、德语、中文和斯瓦希里语),我们用以前提出的跨语言参数传输方法和超模版模式改进了以前提出的在目标语言上从零开始培训的类似规模的模式,培训工作减少64x。我们的方法使得新语言的大型语言培训模式更容易获得,对环境的损害更少。我们公开提供我们的代码和模型。