Language models are the foundation of current neural network-based models for natural language understanding and generation. However, research on the intrinsic performance of language models on African languages has been extremely limited, which is made more challenging by the lack of large or standardised training and evaluation sets that exist for English and other high-resource languages. In this paper, we evaluate the performance of open-vocabulary language models on low-resource South African languages, using byte-pair encoding to handle the rich morphology of these languages. We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs), and Transformers on small-scale datasets. Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets. Multilingual training further improves performance on these datasets. We hope that this research will open new avenues for research into multilingual and low-resource language modelling for African languages.
翻译:语言模型是当前基于神经网络的自然语言理解和生成模型的基础,然而,关于非洲语言语言模型内在性能的研究极为有限,由于缺少针对英语和其他高资源语言的大规模或标准化的培训和评价成套材料,这种研究更具挑战性。在本文件中,我们评估了南非低资源语言开放语言模型的性能,使用字节编码处理这些语言的丰富形态。我们评估了n-gram模型的不同变种、饲料向向神经网络、经常性神经网络和小规模数据集的变换器。总体而言,正规化的RNNS在两个亚祖鲁语和一个Sepedi数据集中表现最佳。多语言培训进一步提高了这些数据集的性能。我们希望这项研究将为非洲语言的多语言和低资源语言建模的研究开辟新的途径。