深视语言理解双电码模型 (Distilled Dual-Encoder Model for Vision-Language Understanding)

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.

翻译：我们提议了一个跨模式关注蒸馏框架,用于为视觉推理和视觉问题解答等视觉语言理解任务培训双向读取模型。双向读取模型比聚合-编码模型具有更快的推导速度,并且能够在推算过程中对图像和文字进行预估。但是,在双向编码模型中使用的浅度互动模块不足以处理复杂的视觉语言理解任务。为了学习图像和文字的深层互动,我们引入了跨模式关注蒸馏,利用一个聚合-编码模型的图像到文字和文字到模拟关注分布来指导我们双重编码模型的培训。此外,我们表明,在培训前和微调阶段应用跨模式的跨模式蒸馏取得了进一步的改进。实验结果显示,蒸馏的双层电解模式在视觉推理、视觉要求和视觉问题解析任务方面达到竞争性的性能,同时在比聚合-编码/Dquorder模型更快得多的推导速度。我们的代码和模型将公开提供。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【DeepMind】多模态预训练模型概述，37页ppt

专知会员服务

95+阅读 · 2021年7月2日

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【AAAI2020】用于视觉对话中深度视觉理解的自适应双向编码模型（DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue）, 中科院信工所于静等

专知会员服务

29+阅读 · 2019年11月23日