African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.
翻译:非洲有超过10亿人使用非洲语言,但在国家语言方案研究和发展中代表不足,阻碍进展的挑战包括附加说明的数据集有限,以及缺乏对当前方法行之有效的环境的了解。在本文件中,我们为应对这些挑战取得进展,重点是确定名称实体的承认任务。我们为20种非洲语言创建了最大的具有人类附加说明的NER数据集,并研究非洲中心环境中最先进的跨语言传输方法的行为,表明源语言的选择对绩效有重大影响。我们表明选择最佳传输语言比英语提高了平均14分的零发F1分,而英语则比英语增加了平均14分。我们的结果突出表明,需要基准数据集和模型,涵盖典型的非洲语言。