We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.
翻译:我们提议了一个跨模式关注蒸馏框架,用于为视觉推理和视觉问题解答等视觉语言理解任务培训双向读取模型。双向读取模型比聚合-编码模型具有更快的推导速度,并且能够在推算过程中对图像和文字进行预估。但是,在双向编码模型中使用的浅度互动模块不足以处理复杂的视觉语言理解任务。为了学习图像和文字的深层互动,我们引入了跨模式关注蒸馏,利用一个聚合-编码模型的图像到文字和文字到模拟关注分布来指导我们双重编码模型的培训。此外,我们表明,在培训前和微调阶段应用跨模式的跨模式蒸馏取得了进一步的改进。实验结果显示,蒸馏的双层电解模式在视觉推理、视觉要求和视觉问题解析任务方面达到竞争性的性能,同时在比聚合-编码/Dquorder模型更快得多的推导速度。我们的代码和模型将公开提供。