Lesan -- -- 低资源语言机器翻译 (Lesan -- Machine Translation for Low Resource Languages)

Millions of people around the world can not access content on the Web because most of the content is not readily available in their language. Machine translation (MT) systems have the potential to change this for many languages. Current MT systems provide very accurate results for high resource language pairs, e.g., German and English. However, for many low resource languages, MT is still under active research. The key challenge is lack of datasets to build these systems. We present Lesan, an MT system for low resource languages. Our pipeline solves the key bottleneck to low resource MT by leveraging online and offline sources, a custom OCR system for Ethiopic and an automatic alignment module. The final step in the pipeline is a sequence to sequence model that takes parallel corpus as input and gives us a translation model. Lesan's translation model is based on the Transformer architecture. After constructing a base model, back translation, is used to leverage monolingual corpora. Currently Lesan supports translation to and from Tigrinya, Amharic and English. We perform extensive human evaluation and show that Lesan outperforms state-of-the-art systems such as Google Translate and Microsoft Translator across all six pairs. Lesan is freely available and has served more than 10 million translations so far. At the moment, there are only 217 Tigrinya and 15,009 Amharic Wikipedia articles. We believe that Lesan will contribute towards democratizing access to the Web through MT for millions of people.

翻译：全世界数百万人无法在网络上访问内容,因为大多数内容都无法随时以其语言提供。机器翻译系统有可能改变许多语言的这种变化。当前的MT系统为高资源语言配对,例如德文和英文提供了非常准确的结果。然而,对于许多低资源语言,MT仍然在积极研究中。关键的挑战在于缺乏建立这些系统的数据集。我们展示了低资源语言的MT系统Lesan。我们的管道通过利用在线和离线来源、Ethiopic的定制OCR系统和自动校正模块,解决了关键瓶颈到低资源MT。目前MT系统解决了关键瓶颈到低资源MT。目前,我们进行广泛的人类评估,并展示了Lesan outformormats Ethiopic System系统序列序列序列的序列,该模型作为投入了平行材料,并给我们提供了翻译模型。Lesan的翻译模型以变换结构为基础。在建立基础模型、背翻译后,用来利用单一语言的Corsoora。目前,Lesan支持从Tigrinya、Amharc 和英语进行翻译。我们进行了广泛的人文评价,并展示了Enexformal-fro-frofroformas-formam-s-s-s-s-s-sal exmationalsilvas

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【杜克-Bhuwan Dhingra】语言模型即知识图谱，46页ppt

专知会员服务

67+阅读 · 2021年11月15日

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【伯克利】黑盒机器翻译系统的模仿攻击与防御，Imitation Attacks and Defenses for Black-box Machine Translation Systems

专知会员服务

8+阅读 · 2020年5月4日