CLIP-TD: CLIP 愿景语言任务定向蒸馏 (CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks)

from arxiv, This paper is greatly modified and updated to be re-submitted to another conference. The new paper is under the name "Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks", https://doi.org/10.48550/arXiv.2204.10496

Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding the tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this potential on a subset of tasks, important questions remain: 1) What is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in low-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches without impacting inference or pretraining complexity? In this work, we seek to answer these questions through two key contributions. First, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data availability constraints and conditions of domain shift. Second, we propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures using a dynamically weighted objective applied to adaptively selected tokens per instance. Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to 71.3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only. On SNLI-VE, CLIP-TD produces significant gains in low-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On VQA, CLIP-TD provides improvement in low-shot (up to 9%), and in fully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works utilizing CLIP for finetuning, as well as baseline naive distillation approaches. Code will be made available.

翻译：对比语言预演( CLIP) 将视觉和语言模式链接到一个统一的嵌入空间, 从而产生视觉语言任务的巨大潜力。虽然早期同时开展的工作已经开始在一组任务上研究这种潜力, 重要的问题仍然存在 :(1) CLIP 的好处是什么? (2) CLIP 在低发或域变换的假设情景中是否有好处? 3 CLIP 能否在不影响推断或预培训复杂性的情况下改进现有方法? 在这项工作中, 我们试图通过两个关键贡献来回答这些问题。首先, 我们引入一个评估协议, 包括视觉常识解释( VCR) 、视觉变真( SNL- VQA) 和视觉问题解答( VQQA), 在各种数据提供数据提供的各种限制和域变换换条件中。其次, 我们提出一个名为 CLIP 目标蒸馏( CLIP 目标蒸馏) 的方法, 将CLIP 的知识充分提取到现有结构中, 使用一个动态的加权目标适用于适应性选择的精细图象。 ( VCRLLLLD) 将CRIP 实现超低的成绩, 在51 和 IM 状态中, 将C- tral- dal- ta化的成绩中, 向下, 。