Multimodal machine translation (MMT) aims to improve translation quality by equipping the source sentence with its corresponding image. Despite the promising performance, MMT models still suffer the problem of input degradation: models focus more on textual information while visual information is generally overlooked. In this paper, we endeavor to improve MMT performance by increasing visual awareness from an information theoretic perspective. In detail, we decompose the informative visual signals into two parts: source-specific information and target-specific information. We use mutual information to quantify them and propose two methods for objective optimization to better leverage visual signals. Experiments on two datasets demonstrate that our approach can effectively enhance the visual awareness of MMT model and achieve superior results against strong baselines.
翻译:多式机器翻译(MMT)旨在通过使原始句子具有相应的图像来提高翻译质量。尽管表现良好,但MMT模型仍然受到输入退化问题的影响:模型更注重文本信息,而视觉信息则普遍被忽视。在本文件中,我们努力通过从信息理论角度提高视觉意识来提高MMT的性能。我们详细将信息性能信号分解为两个部分:源特定信息和目标特定信息。我们使用相互信息量化信息,并提出两种目标优化方法,以更好地利用视觉信号。对两个数据集的实验表明,我们的方法可以有效地提高MMMT模型的视觉意识,并在强大的基线下取得优异效果。