Multilingual language models (MLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. So far, only ~ 28 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a set of massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to four MLMs that each cover any number of African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks and achieves 82.27 average F-1. We also perform error analysis on our models' performance and show the influence of mutual intelligibility when the models are applied under zero-shot settings. We will publicly release our models for research.
翻译:多语言模式(MLMs)在培训前获得宝贵、通用的语言信息,并在具体任务微调方面达到了最新水平,到目前为止,现有语言模式只覆盖了2 000种非洲语言中的28种。我们通过开发涵盖517种非洲语言和语言品种的大规模多语言模式SENENGETI来改善这一限制。我们评估了我们关于20个数据集的8种自然语言理解任务的新模式,比较了每个数据集涵盖非洲任何语文的4种新模式。SENREGETI在8个任务中的11个数据集方面比其他模式要强,还实现了82.27个平均F-1。我们还对模型的性能进行了错误分析,并展示了模型在零光照环境中应用时相互了解的影响。我们将公开发布研究模型。