We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.
翻译:我们显示,光学语言变异器可以在没有人类标签的情况下学习到(如类标签、捆绑盒等)。 现有的工作,无论是明确使用捆绑盒或补丁,都假定视觉骨干首先必须先接受图像网络类预测培训,然后才能融入多式联运语言管道。 我们显示,这没有必要,并引入了一个新的模型“视觉语言变异器”,它建在蒙面自动编码器之上,不需要这种监督。 事实上,在对VILT、目前最先进的补丁语言变异器和我们的模型VLC进行头对头比较时,我们发现我们的方法1. 超越了标准基准的VLT,2. 提供了更多可解释和直观的可视化功能,3. 与许多使用附加注释的捆绑框所培训的ROI的大型模型相比,具有竞争力。