Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.
翻译:没有人类标签(如类标签、捆绑盒等),就可以学习视觉语言变形器; 现有的工作,无论是明确使用捆绑盒或补丁,都假定视觉骨干首先必须先接受图像网络类预测培训,然后才能融入多式语言管道; 我们表明,这没有必要, 并引入了一个新的模型, 由蒙面自动编码器(VLC)上方建起的不需要这种监督的视觉语言变形器(VLC) ; 事实上, 在对VLT(目前最先进的补丁型变形器)和我们的模型VLC(VLC)进行头对头的比较时, 我们发现我们的方法1. 超越了标准基准的VLT(VLT), 2. 提供了更多可解释和直观的视觉化, 3 与许多使用在附加说明的捆绑框上培训的ROI的大型模型具有竞争力。