Multi-modal machine translation (MMT) improves translation quality by introducing visual information. However, the existing MMT model ignores the problem that the image will bring information irrelevant to the text, causing much noise to the model and affecting the translation quality. In this paper, we propose a novel Gumbel-Attention for multi-modal machine translation, which selects the text-related parts of the image features. Specifically, different from the previous attention-based method, we first use a differentiable method to select the image information and automatically remove the useless parts of the image features. Through the score matrix of Gumbel-Attention and image features, the image-aware text representation is generated. And then, we independently encode the text representation and the image-aware text representation with the multi-modal encoder. Finally, the final output of the encoder is obtained through multi-modal gated fusion. Experiments and case analysis proves that our method retains the image features related to the text, and the remaining parts help the MMT model generates better translations.
翻译:多式机器翻译(MMT)通过引入视觉信息来提高翻译质量。 但是,现有的MMT模型忽略了这样的问题,即图像将带来与文本无关的信息,给模型造成很大噪音,并影响翻译质量。 在本文中,我们提出一个用于多式机器翻译的Gumbel-Attention小说,它选择图像特征中与文本有关的部分。具体地说,与以前基于关注的方法不同,我们首先使用一种不同的方法来选择图像信息,并自动删除图像特征中无用的部分。通过 Gumbel-Atention 和图像特征的评分矩阵,生成了图像感知文本表达方式。然后,我们独立地用多式编码器编码了文本表达方式和图像感文本表达方式。最后,编码器的最终输出是通过多式门式组合获得的。实验和案例分析证明,我们的方法保留了与文本有关的图像特征,其余部分有助于MT模型产生更好的翻译。