As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabi\'a-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.
翻译:随着语言模型能力的不断提高,"一刀切" 模型可能仍然是主要范式。例如,考虑到全球众多语言,其中很多是低资源语言,普遍的做法是在多种语言上对单个模型进行预训练。在这篇论文中,我们为挑战这种做法的日益增多的论据增添了一份,证明单语言预训练对目标语言进行了实质性的改进,从而加强了已经广泛训练了多种语料库的模型。更具体地,我们将 GPT-J 和 LLaMA 模型在葡萄牙语文本上进一步预训练,使用原始预训练预算的 3% 或更少。在葡萄牙语数据集 Poeta 上进行少量测试评估,发现我们的模型在表现上显著优于以英语为中心的和多语言模型。我们的最佳模型 Sabiá-65B 与 GPT-3.5-turbo性能相当。通过在最初设计用于目标语言以及翻译的数据集上进行评估,我们研究了单语言预训练在捕捉目标语言固有的语言细微差别和结构以及丰富模型对领域或文化知识的理解方面的贡献。我们的结果表明,大多数优势源于单语言预训练获得的领域特定知识。