How to achieve neural machine translation with limited parallel data? Existing techniques often rely on large-scale monolingual corpora, which is impractical for some low-resource languages. In this paper, we turn to connect several low-resource languages to a particular high-resource one by additional visual modality. Specifically, we propose a cross-modal contrastive learning method to learn a shared space for all languages, where both a coarse-grained sentence-level objective and a fine-grained token-level one are introduced. Experimental results and further analysis show that our method can effectively learn the cross-modal and cross-lingual alignment with a small amount of image-text pairs and achieves significant improvements over the text-only baseline under both zero-shot and few-shot scenarios.
翻译:如何在有限的平行数据下实现神经机器翻译?现有技术往往依赖大型单一语言公司,这对一些低资源语言来说是不切实际的。在本文中,我们转而通过额外的视觉模式将一些低资源语言与特定高资源语言连接起来。具体地说,我们提出一种跨模式的对比学习方法,以学习所有语言共享的空间,既采用粗糙的判刑水平目标,又采用细微的象征性水平。实验结果和进一步分析表明,我们的方法可以有效地学习与少量图像文本对子的跨模式和跨语言匹配,并在零点和几点情景下对仅文本的基线做出重大改进。