African languages have recently been the subject of several studies in Natural Language Processing (NLP) and, this has caused a significant increase in their representation in the field. However, most studies tend to focus more on the models than the quality of the datasets when assessing the models' performance in tasks such as Named Entity Recognition (NER). While this works well in most cases, it does not account for the limitations of doing NLP with low-resource languages, that is, the quality and the quantity of the dataset at our disposal. This paper provides an analysis of the performance of various models based on the quality of the dataset. We evaluate different pre-trained models with respect to the entity density per sentence of some African NER datasets. We hope with this study to improve the way NLP studies are done in the context of low-resourced languages.
翻译:非洲语言最近是若干自然语言处理(NLP)研究的主题,这导致其在外地的代表性大幅度增加,然而,大多数研究在评估模型在诸如名称实体识别(NER)等任务方面的表现时,往往更多地侧重于模型而不是数据集的质量,虽然这在多数情况下效果良好,但并没有说明在使用低资源语言(即我们掌握的数据集的质量和数量)的情况下,采用低资源语言(即质量和数量)进行国家语言处理的局限性。本文根据数据集的质量分析了各种模型的性能。我们评估了非洲一些非洲净资源数据集每句子的实体密度方面不同的预先培训模型。我们希望通过这项研究改进国家语言分类研究的方式。