This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task. We participate in the general translation task on English$\Leftrightarrow$Livonian. Our system is based on M2M100 with novel techniques that adapt it to the target language pair. (1) Cross-model word embedding alignment: inspired by cross-lingual word embedding alignment, we successfully transfer a pre-trained word embedding to M2M100, enabling it to support Livonian. (2) Gradual adaptation strategy: we exploit Estonian and Latvian as auxiliary languages for many-to-many translation training and then adapt to English-Livonian. (3) Data augmentation: to enlarge the parallel data for English-Livonian, we construct pseudo-parallel data with Estonian and Latvian as pivot languages. (4) Fine-tuning: to make the most of all available data, we fine-tune the model with the validation set and online back-translation, further boosting the performance. In model evaluation: (1) We find that previous work underestimated the translation performance of Livonian due to inconsistent Unicode normalization, which may cause a discrepancy of up to 14.9 BLEU score. (2) In addition to the standard validation set, we also employ round-trip BLEU to evaluate the models, which we find more appropriate for this task. Finally, our unconstrained system achieves BLEU scores of 17.0 and 30.4 for English to/from Livonian.
翻译:本文介绍Tententent AI Lab-上海Jiao Tong大学(TAL-STU) 用于WMT22 共同任务的低资源翻译系统。 我们参与英语和利文总翻译任务。 我们的系统以M2M100为基础,采用新技术使其适应目标语言对口。 (1) 跨模范词嵌入匹配:受跨语言字嵌入整合的启发,我们成功地向M2M100(TAL-STUTU)传输一个预先训练的字嵌入M2M100(TAL-STUTU),使其能够支持Livonian 。 (2) 渐进适应战略:我们利用爱沙尼亚和拉脱维亚语作为辅助语言进行许多到多种翻译培训,然后适应英语和利文。 (3) 数据增强:扩大英语和利文的平行数据,我们用爱沙尼亚和拉脱维亚语作为活性语言构建假方语言。 (4) 微调:为了将所有可用数据中的大多数数据,我们用校准模式和在线回译,进一步提升绩效。 在模型评估中,我们发现先前的工作低估了欧盟标准标准标准标准标准值的升级的30(L)的成绩,我们也可以将标准比标准的变换为标准。