Training data for machine learning models can come from many different sources, which can be of dubious quality. For resource-rich languages like English, there is a lot of data available, so we can afford to throw out the dubious data. For low-resource languages where there is much less data available, we can't necessarily afford to throw out the dubious data, in case we end up with a training set which is too small to train a model. In this study, we examine the effects of text normalization and data set quality for a set of low-resource languages of Africa -- Afrikaans, Amharic, Hausa, Igbo, Malagasy, Somali, Swahili, and Zulu. We describe our text normalizer which we built in the Pynini framework, a Python library for finite state transducers, and our experiments in training language models for African languages using the Natural Language Toolkit (NLTK), an open-source Python library for NLP.
翻译:机器学习模式的培训数据可能来自许多不同来源,其质量可能令人怀疑。 对于像英语这样的资源丰富的语言,有很多数据可供使用,因此我们可以承担丢弃可疑的数据。对于可用数据少得多的低资源语言,我们不能承担丢弃可疑的数据,万一我们最终要用一个小到无法培训模型的训练组来完成。在这项研究中,我们研究了一套非洲低资源语言 -- -- 南非荷兰语、阿姆哈拉语、豪萨语、伊格博语、马达加斯加语、索马里语、斯瓦希里语和祖鲁语 -- -- 的文本正常化和数据集质量的影响。我们描述了我们在皮尼框架内建造的文本标准化器,一个供有限的国家转基因师使用的Python图书馆,以及我们在非洲语言培训模式中使用自然语言工具包(NLTK)的实验,这是NLP的开放源的Python图书馆。