Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's $7000$+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing~\ourLID, a neural LID toolkit for $517$ African languages and varieties.~\ourLID~exploits a multi-domain web dataset manually curated from across $14$ language families utilizing five orthographic systems. When evaluated on our blind Test set,~\ourLID~achieves $95.89$ $F_1$-score. We also compare~\ourLID~to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of~\ourLID~in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase~\ourLID's powerful capabilities and limitations.
翻译:语言识别(LID)是NLP的关键前体,特别是对于采矿网络数据而言。 问题在于,当今世界上7000美元+语言中大多数的7000美元以上语言没有被LID技术覆盖。 我们通过引入一个517美元非洲语言和品种的神经LID工具包(NourLID)来解决非洲面临的这一紧迫问题。 ⁇ ourLID~ 利用五种矫形系统,开发了来自14美元语言家庭的多面网络数据集。 当对我们的盲人测试集进行评估时, ⁇ ourLID~ achieves 95.89美元 $F_1美元核心。 我们还将“LID~ ” 与五个现有的LID工具进行了比较, 每一个工具都覆盖了少数非洲语言, 发现它能够通过大多数语言来超越这些语言。 我们进一步展示了“LID~”在野外的效用,在服务严重不足的Twitter域进行测试。 最后,我们提供了一些受控制的案例研究,并进行了语言动机错误分析,使我们能够同时展示“LID”ourLID的强大能力和局限性。