以多语言语言模式学习 (Few-shot Learning with Multilingual Language Models)

Xi Victoria Lin,Todor Mihaylov,Mikel Artetxe,Tianlu Wang,Shuohui Chen,Daniel Simig,Myle Ott,Naman Goyal,Shruti Bhosale,Jingfei Du,Ramakanth Pasunuru,Sam Shleifer,Punit Singh Koura,Vishrav Chaudhary,Brian O'Horo,Jeff Wang,Luke Zettlemoyer,Zornitsa Kozareva,Mona Diab,Veselin Stoyanov,Xian Li

from arxiv, 36 pages

Large-scale autoregressive language models such as GPT-3 are few-shot learners that can perform a wide range of language tasks without fine-tuning. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 translation directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning on some tasks, while there is still room for improvement on surface form robustness and adaptation to tasks that do not have a natural cloze form. Finally, we evaluate our models in social value tasks such as hate speech detection in five languages and find it has limitations similar to comparable sized GPT-3 models.

翻译：GPT-3等大型自动递减语言模型是几发微小的学习者,这些模型可以在不作微调的情况下完成广泛的语言任务。虽然这些模型已知能够共同代表多种不同语言,但其培训数据以英语为主,有可能限制其跨语言的概括性。在这项工作中,我们用一个均衡的文体,在涵盖多种语言的一套不同的语文中培训多语言自动递减语言模型,并研究其少数和零发的学习能力。我们拥有75亿个参数的最大模型,以20多种有代表性的语言在几发式学习中设置了新的最新水平,在多语种常识推理中超过了类似规模的GPT-3,在多语种常识推理中(在0发环境中+7.4%绝对精确性改进,在4发式环境中+9.4%)和自然语言推论(在0发式和4发式环境中各加5.4%)中,我们培训了多发语言翻译能力模型,在182个翻译方向中比GPT-3模型高出了32个,同时在45个方向上超过了官方监督基线。我们提出了类似的GPT-3模型,在进行精确的对比性分析时,最后在进行精确分析,在进行这种分析时,在进行这种分析,在学习时,在学习时,在学习时,在地面上也无法进行精确地展示,在学习,在学习,在学习,在学习,在学习。