Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets. While these datasets reach an order of 10 million samples, the labor cost is prohibitive to scale further. Conversely, unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions. As a result, unimodal encoders have achieved state-of-art (SOTA) on many downstream tasks. However, challenges remain when applying to VL tasks. The pretraining data is not optimal for cross-modal architectures and requires heavy computational resources. In addition, unimodal architectures lack cross-modal interactions that have demonstrated significant benefits for VL tasks. Therefore, how to best leverage pretrained unimodal encoders for VL tasks is still an area of active research. In this work, we propose a method to leverage unimodal vision and text encoders for VL tasks that augment existing VL approaches while conserving computational complexity. Specifically, we propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders. Second, to better capture nuanced impacts on VL task performance, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data constraints and conditions of domain shift. Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data. Finally, MAD outperforms concurrent works utilizing pretrained vision encoder from CLIP. Code will be made available.
翻译:VL任务跨模式编码器往往先于VIP任务,然后经过仔细整理的视觉语言数据集。虽然这些数据集到达了1,000万个样本的顺序,但劳动力成本却难以进一步扩大。相反,单式编码器先于更简单的说明,其成本要求较低,达到数亿至数十亿的尺度。结果,单式编码器在许多下游任务上达到了最先进的(SOTA) 。然而,在应用VL任务时,挑战依然存在。 预培训数据对于跨模式结构来说并不理想,需要大量计算资源。此外,单式编码器结构缺乏交叉式互动,这些互动对VL任务具有重大效益。因此,如何最好地利用预型编码编码编码编码器,达到数亿至数十亿的尺度。在这项工作中,我们提出了一种方法,在维护计算复杂性的同时,将VLVL任务前的图像和文字编码编码器用于增强现有的VL 。具体地,我们提出了多式的可调制的货币结构结构结构结构,将SNBRR-R-S-S-R-Silver-deal-deal-deal-deal-deal-deal-deal-deal-de-de-de-deal-deal-de-de-de-deal-deal-deal-deal-deal-deal-deal-de-de-de-de-de-de-de-de-de-de-deal-deal-de-deal-develutututisal-de-de-de-deal-deal-de-de-de-de-de-de-demental-deal-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-deal-de-de-de-deal-deal-deal-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-