Past works on multimodal machine translation (MMT) elevate bilingual setup by incorporating additional aligned vision information. However, an image-must requirement of the multimodal dataset largely hinders MMT's development -- namely that it demands an aligned form of [image, source text, target text]. This limitation is generally troublesome during the inference phase especially when the aligned image is not provided as in the normal NMT setup. Thus, in this work, we introduce IKD-MMT, a novel MMT framework to support the image-free inference phase via an inversion knowledge distillation scheme. In particular, a multimodal feature generator is executed with a knowledge distillation module, which directly generates the multimodal feature from (only) source texts as the input. While there have been a few prior works entertaining the possibility to support image-free inference for machine translation, their performances have yet to rival the image-must translation. In our experiments, we identify our method as the first image-free approach to comprehensively rival or even surpass (almost) all image-must frameworks, and achieved the state-of-the-art result on the often-used Multi30k benchmark. Our code and data are available at: https://github.com/pengr/IKD-mmt/tree/master..
翻译:提高多模态机器翻译 (MMT) 水平的过去的方法是通过整合额外的对齐图像信息来改进双语语言系统。然而,这种多模态数据集中要求图片的要求大大阻碍了MMT的发展——即它需要一个对齐的形式,即[图像、源文本、目标文本]。在推理阶段,与普通的NMT设置相比,这种限制通常会带来麻烦,尤其是当没有提供对齐图像时。因此,在这项工作中,我们引入了一种新的MMT框架: 反演知识蒸馏 (IKD-MMT),通过一个反演知识蒸馏方案来支持无图像的推理阶段。特别地,一个多模态特征生成器通过一个知识蒸馏模块执行,该知识蒸馏模块可以直接从(仅)源文本作为输入生成多模态特征。虽然以前有一些工作考虑支持无图像的机器翻译,但他们的性能还没有达到与需要图像的翻译相当。在我们的实验中,我们认定我们的方法是第一个全面能够与 (几乎)所有需要图像的框架相匹配甚至超越的无图像方法,并在经常使用的Multi30k基准测试中实现了最先进的结果。我们的代码和数据可在 https://github.com/pengr/IKD-mmt/tree/master 中获得。