We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.
翻译:我们采取步骤,解决非洲大陆在国家学习计划研究中代表性不足的问题,为此,我们以十种非洲语言创建了第一个公共可公开获取的高质量大数据集,用于名称实体识别(NER),汇集了各种利益攸关方,我们详细说明了这些语言的特点,以帮助研究人员了解这些语言给非洲学习计划带来的挑战,我们分析了我们的数据集,并对在受监管和转让学习环境中采用的最新方法进行了广泛的经验评估,我们发布了数据、代码和模型,以激励今后对非洲学习计划的研究。