Millions of people around the world can not access content on the Web because most of the content is not readily available in their language. Machine translation (MT) systems have the potential to change this for many languages. Current MT systems provide very accurate results for high resource language pairs, e.g., German and English. However, for many low resource languages, MT is still under active research. The key challenge is lack of datasets to build these systems. We present Lesan, an MT system for low resource languages. Our pipeline solves the key bottleneck to low resource MT by leveraging online and offline sources, a custom OCR system for Ethiopic and an automatic alignment module. The final step in the pipeline is a sequence to sequence model that takes parallel corpus as input and gives us a translation model. Lesan's translation model is based on the Transformer architecture. After constructing a base model, back translation, is used to leverage monolingual corpora. Currently Lesan supports translation to and from Tigrinya, Amharic and English. We perform extensive human evaluation and show that Lesan outperforms state-of-the-art systems such as Google Translate and Microsoft Translator across all six pairs. Lesan is freely available and has served more than 10 million translations so far. At the moment, there are only 217 Tigrinya and 15,009 Amharic Wikipedia articles. We believe that Lesan will contribute towards democratizing access to the Web through MT for millions of people.
翻译:全世界数百万人无法在网络上访问内容,因为大多数内容都无法随时以其语言提供。机器翻译系统有可能改变许多语言的这种变化。当前的MT系统为高资源语言配对,例如德文和英文提供了非常准确的结果。然而,对于许多低资源语言,MT仍然在积极研究中。关键的挑战在于缺乏建立这些系统的数据集。我们展示了低资源语言的MT系统Lesan。我们的管道通过利用在线和离线来源、Ethiopic的定制OCR系统和自动校正模块,解决了关键瓶颈到低资源MT。目前MT系统解决了关键瓶颈到低资源MT。目前,我们进行广泛的人类评估,并展示了Lesan outformormats Ethiopic System系统序列序列序列的序列,该模型作为投入了平行材料,并给我们提供了翻译模型。Lesan的翻译模型以变换结构为基础。在建立基础模型、背翻译后,用来利用单一语言的Corsoora。目前,Lesan支持从Tigrinya、Amharc 和英语进行翻译。我们进行了广泛的人文评价,并展示了Enexformal-fro-frofroformas-formam-s-s-s-s-s-sal exmationalsilvas