We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in vision-and-language tasks, achieved by using a shallow image encoder. However, it is pretrained on captioning and similar datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data (in our work, Twitter), there is a notable shift from captioning language data, as well as diversity of tasks, and we indeed find evidence that the language capacity of ViLT is lacking instead. The key insight of VAuLT is to propagate the output representations of a large language model like BERT to the language input of ViLT. We show that such a strategy significantly improves over ViLT on vision-and-language tasks involving richer language inputs and affective constructs, such as TWITTER-2015, TWITTER-2017, MVSA-Single and MVSA-Multiple, but lags behind pure reasoning tasks such as the Bloomberg Twitter Text-Image Relationship dataset. We have released the code for all our experiments at https://github.com/gchochla/VAuLT.
翻译:我们提议采用视觉和放大语言变换器(VauLT)。 VAULT是广受欢迎的视觉和语言变换器(ViLT)的延伸,它也是广受欢迎的视觉和语言变换器(ViLT)的延伸,它改进了视觉和语言工作的业绩,涉及比图像字幕更复杂的文字投入而不是图像字幕说明,同时对培训和推断效率的影响最小。ViLT(ViLT)很重要,它使得通过使用浅浅图像变色器在视觉和语言任务方面能够进行有效的培训和推断。然而,VAULT在字幕和类似的数据集方面已经预先掌握了语言输入简单、字型和描述性,因此缺乏语言的实验性。因此,当与野生多媒体数据,例如多式社会媒体数据(在我们的工作中,Twitter)一起工作时,它明显地改变了语言数据说明,以及任务的多样性。 ViLT(BERT)LT)的关键见解是将大型语言变色模型的输出式表述方式展示给 ViltLTLT(BERT)的语言输入。我们显示这样的战略大大改进了S-LTVIT(TLTLTLT)在视觉和语言变色LVITLT-LT-LT-LVA的背后,例如VLVLT-LT-LT-LT-LT-LT-LT-LT-LT-LUT)的推式的背后的变相和推制式任务,例如V-LT-LVIT-LT-LT-LT-LUT-LT-LUT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-LT-L