We participated in the WMT 2022 Large-Scale Machine Translation Evaluation for the African Languages Shared Task. This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier that was built by fine-tuning a pre-trained language model. To train the classifier, we obtain positive samples (i.e. high-quality parallel sentences) from a gold-standard curated dataset and extract negative samples (i.e. low-quality parallel sentences) from automatically aligned parallel data by choosing sentences with low alignment scores. Our final machine translation model was then trained on filtered data, instead of the entire noisy dataset. We empirically validate our approach by evaluating on two common datasets and show that data filtering generally improves overall translation quality, in some cases even significantly.
翻译:我们参加了2022年非洲语言共享任务的WMT大型机器翻译评估。 这项工作描述了我们的方法,其基础是使用通过微调预先培训的语言模式而建立的句式语言分类器过滤给定的噪音数据。 为了培训分类师,我们从一个金质标准分类数据集中获取积极的样本(即高质量的平行句子 ), 并通过选择低校准分从自动一致的平行数据中提取负面样本(即低质量平行句子 ) 。 我们的最后机器翻译模型随后接受了关于过滤数据的培训,而不是整个音响数据集。 我们通过对两个共同数据集进行评估,并表明数据过滤通常会提高总体翻译质量,在某些情况下甚至会显著改善数据过滤质量,从而验证了我们的方法。