在引信前对齐:充满动力的愿景和语言代表性学习 (Align before Fuse: Vision and Language Representation Learning with Momentum Distillation)

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and pre-trained models are available at https://github.com/salesforce/ALBEF/.

翻译：大型视觉和语言代表学习在各种视觉语言任务上显示出了有希望的改善。多数现有方法都使用基于变压器的多式联运编码器来联合模拟视觉标志( 区域图像特征) 和文字符号。由于视觉标志和文字符号不对齐, 多式联运编码器学习图像文本互动是一个挑战。在本文中, 我们从相互信息最大化的角度对图像和文字表达方式进行理论分析, 显示不同的培训任务可以被解释为为图像- 文本配对生成观点的不同方式, 从而可以更有根地进行视觉和语言代表学习。与大多数现有方法不同, 我们的方法不需要捆绑框说明或高分辨率图像。为了改进对杂乱的网络数据进行学习, 我们建议了动力蒸馏, 这是一种自我培训的方法, 学习了通过一种动力模型产生的假目标。我们从相互信息最大化的角度对 ALBEF 进行理论分析, 显示不同的培训任务可以被解释为为图像- 文本配对的不同方式。 ALBEF 与多数现有框说明或高清晰度的图像- 格式任务相比, 。在多个下游的图像- $_ 格式任务中, 我们建议- 正在检索的ALBR_ 方法和ALEF2 中, 在图像- greal- greal- dreal- dreal- sal- dreal- lax- lax- lax- sal- lad- dreal- dreal- lax- sal- sal- sal- lad- lad- lad- lad- lad- lad- sal- lad- lad- sal- lad- lad- lad- lad- sre- lad- lad- sal- sal- sal- sal- sal- lad- sal- sal- sal- sal- sal- lad- sal- sal- sal- lad- sal- sal- sal- sal- sal- lad- lad- lad- sal- sal- sal- lad- lad- lad- lad-