Large generative language models have been very successful for English, but other languages lag behind, in part due to data and computational limitations. We propose a method that may overcome these problems by adapting existing pre-trained models to new languages. Specifically, we describe the adaptation of English GPT-2 to Italian and Dutch by retraining lexical embeddings without tuning the Transformer layers. As a result, we obtain lexical embeddings for Italian and Dutch that are aligned with the original English lexical embeddings. Additionally, we scale up complexity by transforming relearned lexical embeddings of GPT-2 small to the GPT-2 medium embedding space. This method minimises the amount of training and prevents losing information during adaptation that was learned by GPT-2. English GPT-2 models with relearned lexical embeddings can generate realistic sentences in Italian and Dutch. Though on average these sentences are still identifiable as artificial by humans, they are assessed on par with sentences generated by a GPT-2 model fully trained from scratch.
翻译:对英语来说,大型基因化语言模式非常成功,但其他语言则落后于其他语言,部分原因是由于数据和计算上的限制。我们建议了一种方法,通过将现有的预先培训的模式适应新的语言来克服这些问题。具体地说,我们描述了通过在不调整变异器层的情况下再培训词汇嵌入器将英语GPT-2改造成意大利和荷兰语的情况。结果,我们获得了意大利和荷兰语与原始英国法律嵌入器相一致的词汇嵌入器。此外,我们通过将GPT-2小的重新学习的词汇嵌入器转换为GPT-2中等嵌入空间,扩大了复杂性。这种方法最大限度地减少了培训数量,避免了在GPT-2所学的适应过程中丢失信息。英语GPT-2模型与重新学习词汇嵌入器生成的英语GPT-2模型可以在意大利和荷兰语中生成现实的句子。虽然这些句子通常仍由人类人工识别,但评估它们与GPT-2模型从零开始充分训练的句子相同。