Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.
翻译:多数现有的视觉和语言模型(V&L)模式都依赖经过预先培训的视觉编码器,使用相对较少的一组人工加注数据(与网络版数据相比)来看待视觉世界,然而,据观察,大型的预先训练通常可以产生更好的概括性表现,例如,CLIP(语言-图像前培训),在大量图像加载配对方面受过培训,显示对各种视觉任务具有很强的零射能力。为了进一步研究CLIP带来的优势,我们提议在两种典型情况下使用CLIP作为各种V &L模型的视觉编码器:1)将CLIP插入特定任务的微调;2)将CLIP与V &L培训前和向下游任务转让结合起来。我们显示,CLIP大大超越了广泛使用的、经过内部加注解数据培训的视觉编码器,如BOFOUD-Town。我们在不同的V &LIP任务中取得竞争性或更好的结果,同时在新的V&LLVAL-SVAL-SVAD/VADL任务中确立新的状态/VGRVG/VG-R-LDRUD/VGRDRD/VG/VGRVGRDRY/VLDRUDR) 。