Large web-crawled corpora represent an excellent resource for improving the performance of Neural Machine Translation (NMT) systems across several language pairs. However, since these corpora are typically extremely noisy, their use is fairly limited. Current approaches to dealing with this problem mainly focus on filtering using heuristics or single features such as language model scores or bi-lingual similarity. This work presents an alternative approach which learns weights for multiple sentence-level features. These feature weights which are optimized directly for the task of improving translation performance, are used to score and filter sentences in the noisy corpora more effectively. We provide results of applying this technique to building NMT systems using the Paracrawl corpus for Estonian-English and show that it beats strong single feature baselines and hand designed combinations. Additionally, we analyze the sensitivity of this method to different types of noise and explore if the learned weights generalize to other language pairs using the Maltese-English Paracrawl corpus.
翻译:大型网络组合体是改善多种语文对口神经机器翻译系统(NMT)性能的极好资源,然而,由于这些组合体通常极为吵闹,因此其使用相当有限。目前处理这一问题的办法主要侧重于使用超自然学或单一特征过滤,如语言模型分数或双语言相似性等。这项工作提出了另一种方法,用于学习多种判刑等级特征的权重。这些直接用于改进翻译性能的特征权重被优化,被用来更有效地在吵闹的组合体中进行评分和过滤。我们提供了运用这一技术来建立NMT系统的结果,利用爱沙尼亚-英语的准光谱系统,并表明它胜过强大的单一特征基线和手工设计的组合。此外,我们分析了这一方法对不同类型噪音的敏感性,并探索,是否利用马耳他-英语参数将学到的权重一般化为其他语言的对子。