As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabi\'a-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.
翻译:随着语言模型的功能不断提升,一个“一刀切”的模型仍然会成为主要范例。例如,考虑到全球数量众多的语言,其中许多都是低资源语言,因此普遍的做法是在多种语言上对单一模型进行预训练。在本文中,我们为质疑这一做法的证据不断增加,证明在目标语言上进行单一语言预训练可以显著提高已经基于多语料库广泛训练的模型。更具体地说,我们使用原先预训练预算的3%或更少对GPT-J和LLaMA模型在葡萄牙语文本上进行进一步预训练。对Poeta(一套包含14个葡萄牙语数据集的套件)进行少量样本的评估显示,我们的模型在表现上优于以英语为中心的和多语言的对应模型。我们最好的模型Sabiá-65B的表现与GPT-3.5-turbo相当。通过在最初面向目标语言以及翻译的数据集上进行评估,我们研究了单一语言预训练在以下方面的贡献:1)捕捉目标语言内在的语言细微差别和结构,2)丰富模型对领域或文化的知识。我们的结果表明,大部分的好处来自于通过单语言预训练获得的领域特定知识。