Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.
翻译:语言识别(LID)是NLP的关键前体,特别是对于采矿网络数据而言。 问题在于,当今世界上大多数7000+语言的7000+语言没有被LID技术所覆盖。 我们通过引入用于517美元非洲语言和品种的神经LID工具包AFLID来解决非洲面临的这一紧迫问题。 AfroLID利用五种方位系统从14种语言家庭手工整理的多域网络数据集。在对我们的盲人测试集进行评估时,AfroLID达到了95.89 F_1-score。我们还将AfroLID与现有的5种现有LID工具进行了比较,其中每种工具都涵盖少量非洲语言,发现它能够以大多数语言表现这些语言。我们进一步展示了AfroLID在野外的效用,在服务严重不足的Twitter域测试了它。 最后,我们提供了一些受控制的案例研究,并进行了语言驱动的错误分析,使我们能够同时展示AfroLID的强大能力和局限性。