One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as an image. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations but also by the lack of specific evaluation and training data. We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism and which is jointly trained on both visual masking and MMT. We also release CoMMuTE, a Contrastive Multilingual Multimodal Translation Evaluation dataset, composed of ambiguous sentences and their possible translations, accompanied by disambiguating images corresponding to each translation. Our approach obtains competitive results over strong text-only models on standard English-to-French benchmarks and outperforms these baselines and state-of-the-art MMT systems with a large margin on our contrastive test set.
翻译:机器翻译(MT)的主要挑战之一是模棱两可,有时可以通过图像等附带背景加以解决,然而,MMT(MMT)最近的工作表明,从图像中获得改进具有挑战性,不仅由于难以建立有效的跨模式表述,而且由于缺乏具体的评价和培训数据而受到限制。我们提出了一种新的MMT方法,其基础是只使用纯文本的强型模型,该模型使用神经适应器和新型的引导自控机制,并在视觉掩码和MMMT方面进行联合培训。我们还发布了由模棱两可的句子及其可能的翻译组成的多语言多模式翻译评价数据集(COMMuTE),该数据集由模糊的句子组成,并配有与每种翻译相配的模糊图像。我们的方法取得了竞争性的结果,它优于英法标准基准的强型文本模型,超越了这些基准和最先进的MMT系统,在我们的对比测试数据集上有很大的优势。