We introduce MTet, the largest publicly available parallel corpus for English-Vietnamese translation. MTet consists of 4.2M high-quality training sentence pairs and a multi-domain test set refined by the Vietnamese research community. Combining with previous works on English-Vietnamese translation, we grow the existing parallel dataset to 6.2M sentence pairs. We also release the first pretrained model EnViT5 for English and Vietnamese languages. Combining both resources, our model significantly outperforms previous state-of-the-art results by up to 2 points in translation BLEU score, while being 1.6 times smaller.
翻译:我们引入了MTet, 这是可供公众查阅的英文-越南文翻译的最大平行文件。 MTet 由4.2M 高质量培训配对和由越南研究界改进的多域测试组组成。 结合以前关于英语- 越南文翻译的工作,我们将现有的平行数据集增加到6.2M 句。 我们还为英语和越南语发布了第一个预先培训的EnVIT5模型。 将这两种资源结合起来, 我们的模型在翻译BLEU分数方面大大优于以往的最新结果, 最多比BLEU分数高出2个百分点, 更小1.6倍 。