Multimodal machine translation (MMT) aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. These studies face two challenges. First, they can only utilize triple data (bilingual texts with images), which is scarce; second, current benchmarks are relatively restricted and do not correspond to realistic scenarios. Therefore, this paper correspondingly establishes new methods and new datasets for MMT. First, we propose a framework 2/3-Triplet with two new approaches to enhance MMT by utilizing large-scale non-triple data: monolingual image-text data and parallel text-only data. Second, we construct an English-Chinese {e}-commercial {m}ulti{m}odal {t}ranslation dataset (including training and testing), named EMMT, where its test set is carefully selected as some words are ambiguous and shall be translated mistakenly without the help of images. Experiments show that our method is more suitable for real-world scenarios and can significantly improve translation performance by using more non-triple data. In addition, our model also rivals various SOTA models in conventional multimodal translation benchmarks.
翻译:多式机器翻译(MMT)旨在通过纳入其他模式(如愿景)的信息来提高翻译质量。以前的MMT系统主要侧重于更好地获取和使用视觉信息,并倾向于验证其图像相关数据集的方法。这些研究面临两个挑战。首先,它们只能使用三重数据(带有图像的双语文本),这是稀缺的;其次,目前的基准相对有限,与现实情景不符。因此,本文件相应地为MMT制定了新方法和新的数据集。首先,我们提出了一个框架2/3-Triplet,采用两种新的方法,利用大型非三重数据加强MMT:单语图像文本数据和平行文本只读数据。第二,我们建造了一个英文-中文{e}-商用{m{ulti{m}}toradslanation数据集(包括培训和测试),称为EMMT,其测试集被仔细选择为某些词是模糊的,在没有图像帮助的情况下,应被错误地翻译。实验表明,我们的方法更适合真实的模型情景,并且能够大大改进我们使用非传统翻译的模型。