The performance of Neural Machine Translation (NMT) depends significantly on the size of the available parallel corpus. Due to this fact, low resource language pairs demonstrate low translation performance compared to high resource language pairs. The translation quality further degrades when NMT is performed for morphologically rich languages. Even though the web contains a large amount of information, most people in Sri Lanka are unable to read and understand English properly. Therefore, there is a huge requirement of translating English content to local languages to share information among locals. Sinhala language is the primary language in Sri Lanka and building an NMT system that can produce quality English to Sinhala translations is difficult due to the syntactic divergence between these two languages under low resource constraints. Thus, in this research, we explore effective methods of incorporating Part of Speech (POS) tags to the Transformer input embedding and positional encoding to further enhance the performance of the baseline English to Sinhala neural machine translation model.
翻译:神经机器翻译(NMT)的性能在很大程度上取决于现有平行文件的大小。 由于这一事实,低资源语言对对的翻译性能比高资源语言对的翻译性能低。 当对形态丰富的语言进行NMT时,翻译质量会进一步下降。尽管网络包含大量信息,但斯里兰卡大多数人无法正确阅读和理解英语。因此,将英语内容翻译成当地语言以便当地人共享信息的要求很大。僧伽罗语是斯里兰卡的主要语言,建立能够向僧伽罗语提供高质量英语翻译的NMT系统很困难,因为这两种语言在资源限制下存在合成差异。因此,在这项研究中,我们探索了将部分语言标记纳入变压器输入嵌入和定位编码的有效方法,以进一步提高Sinhala神经机器翻译模式的英语基线性能。