Multi-modal machine translation (MMT) improves translation quality by introducing visual information. However, the existing MMT model ignores the problem that the image will bring information irrelevant to the text, causing much noise to the model and affecting the translation quality. This paper proposes a novel Gumbel-Attention for multi-modal machine translation, which selects the text-related parts of the image features. Specifically, different from the previous attention-based method, we first use a differentiable method to select the image information and automatically remove the useless parts of the image features. Experiments prove that our method retains the image features related to the text, and the remaining parts help the MMT model generates better translations.
翻译:多式机器翻译(MMT)通过引入视觉信息来提高翻译质量。 但是,现有的MMT模型忽略了图像与文本无关、给模型造成很大噪音并影响翻译质量的问题。 本文建议对多式机器翻译采用新型的 Gumbel-Attention, 用于选择图像特征中与文本有关的部分。 具体地说, 与以往的关注方法不同, 我们首先使用不同的方法来选择图像信息, 并自动删除图像功能中的无效部分。 实验证明我们的方法保留了与文本相关的图像特征, 其余部分有助于 MMMT 模型产生更好的翻译。