This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering the South African Government newspaper (Vuk'uzenzele), as well as South African government speeches (ZA-gov-multilingual), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation (NMT) tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning a massively multilingual pre-trained language model.
翻译:介绍了两种不同南非语言的多语种政府主题语料库。这些语料库通过收集南非政府报纸(Vuk'uzenzele)和南非政府演讲(ZA-gov-multilingual)并翻译成所有11种南非官方语言而得来。这些语料库可用于各种自然语言处理任务。这些语料库旨在让研究人员研究南非政府出版物中使用的语言,并重点关注南非政府官员如何与其选民进行交流。本文介绍了收集,清理和提供这些语料库的过程。我们使用语言不可知的句子表示(LASER)嵌入创建了用于神经机器翻译(NMT)任务的并行句子语料库。通过这些对齐的句子,我们使用大规模多语言预训练语言模型的微调为9种土著语言提供NMT基准。