Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.
翻译:培训前的视觉语言(VL)最近受到相当的重视,然而,大多数现有的培训前端到端前的方法,不是仅仅旨在完成图像-文字检索、视觉问答(VQA)和图像说明等VL任务,就是测试图像高级理解的图像,或者仅仅针对区域一级的任务,如文字定位和物体探测等。我们介绍了一个新的VBER(Fusion-In-In-Backbone-backbone-sbrouper)模型结构,可以无缝地处理这两种类型的任务。而不是专门为单式骨架之后的融合而设置变压器层,FIBER常常通过在图像和文字主干网中插入交叉关注的图像和文字解答(VQBER),在记忆和性能方面带来进步。此外,我们与以往的工作不同,要么只是对图像-文字数据进行预先培训,要么是带有箱级说明的精细度数据,我们提出了一种两阶段的训练前战略,使用这两种有效的数据:(i) 精确的Qri-in-in-real relifrial relial-real relial relial reliation aly)