Existing multilingual machine translation approaches mainly focus on English-centric directions, while the non-English directions still lag behind. In this work, we aim to build a many-to-many translation system with an emphasis on the quality of non-English language directions. Our intuition is based on the hypothesis that a universal cross-language representation leads to better multilingual translation performance. To this end, we propose \method, a training method to obtain a single unified multilingual translation model. mCOLT is empowered by two techniques: (i) a contrastive learning scheme to close the gap among representations of different languages, and (ii) data augmentation on both multiple parallel and monolingual data to further align token representations. For English-centric directions, mCOLT achieves competitive or even better performance than a strong pre-trained model mBART on tens of WMT benchmarks. For non-English directions, mCOLT achieves an improvement of average 10+ BLEU compared with the multilingual baseline.
翻译:现有多语种机器翻译方法主要侧重于以英语为中心的方向,而非英语方向仍然落后。在这项工作中,我们的目标是建立一个多语种翻译系统,强调非英语语言方向的质量。我们的直觉是基于一个假设,即通用的跨语言代表制能够提高多语种翻译的绩效。为此,我们提议采用方法,即培训方法,以获得单一的统一多语种翻译模式。 mCOLT通过两种技术获得授权:(一) 一种对比式学习计划,以缩小不同语言代表之间的差距,以及(二) 扩大多语种平行和单语种数据的数据,以进一步统一象征性代表制。对于以英语为中心的方向,MCOLT在WMT的数十个基准上比经过良好预先培训的模型MBART具有竞争力甚至更好的表现。对于非英语方向,MCOLT实现了与多语种基线相比平均10+ BLEU的改进。