Neural machine translation (NMT) has recently gained widespread attention because of its high translation accuracy. However, it shows poor performance in the translation of long sentences, which is a major issue in low-resource languages. It is assumed that this issue is caused by insufficient number of long sentences in the training data. Therefore, this study proposes a simple data augmentation method to handle long sentences. In this method, we use only the given parallel corpora as the training data and generate long sentences by concatenating two sentences. Based on the experimental results, we confirm improvements in long sentence translation by the proposed data augmentation method, despite its simplicity. Moreover, the translation quality is further improved by the proposed method, when combined with back-translation.
翻译:最近,由于翻译准确性很高,神经机器翻译(NMT)最近得到了广泛的关注,然而,它表明长句翻译工作表现不佳,这是低资源语言中的一个主要问题,假定这一问题是由于培训数据中长句数不足造成的,因此,本研究报告提出一个简单的数据扩增方法来处理长句子。在这种方法中,我们只使用特定的平行连体作为培训数据,通过共判两句来产生长句子。根据实验结果,我们确认,尽管拟议的数据扩增方法很简单,但长句子翻译工作也有改进。此外,拟议的方法与反译相结合,使翻译质量进一步提高。