Vision-and-Language Pretraining (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches for VLP heavily rely on image feature extraction processes, most of which involve region supervisions (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the actual multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual encoder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.
翻译:视觉和语言预科培训(VLP)在各种共同愿景和语言下游任务上的业绩有所改进。目前VLP的方法主要依赖图像特征提取过程,其中多数涉及区域监督(如物体探测)和革命结构(如ResNet ) 。 尽管在文献中被忽视,但我们发现在以下两个方面都存在问题:(1) 效率/速度,即仅仅提取输入特征所需要的计算量远远多于实际的多式联运互动步骤;(2) 表达力,因为它与视觉编码器及其预先定义的视觉词汇的表达力高度相连。 在本文件中,我们提出了一个最小的VLP模型、视觉和语言变形器(VILT ), 单立语, 意思是视觉投入的处理被大大简化到与我们处理文字输入的相同的无革命性方式。我们显示VILT比以往的VLP模型快60倍,但具有竞争性或更高的下游任务性。