The development of language models have moved from encoder-decoder to decoder-only designs. In addition, the common knowledge has it that the two most popular multimodal tasks, the generative and contrastive tasks, tend to conflict with one another, are hard to accommodate in one architecture, and further need complex adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach.
翻译:发展语言模型的方法已经从编码器-解码器设计转向了仅解码器的设计。此外,一般认为两个最流行的多模态任务——生成任务和对比任务——往往会彼此冲突,并且很难在一个架构中互相兼容,而且还需要进行复杂的调整以适用于下游任务。我们提出了一种新的培训范式,即使用仅解码器模型来进行多模态任务的训练,该模型能够出乎意料地有效地联合学习这些不同的视觉-语言任务。这是通过一种名为MaMMUT的简单模型完成的。它由一个单一的视觉编码器和一个文本解码器组成,并能够通过文本解码器上的新的两次通过方法来容纳对比和生成学习。我们展示了这些不同目标的联合学习是简单、有效的,能够最大限度地共享模型在这些任务中的权重。此外,同一架构还可以直接扩展到开放词汇的目标检测和视频-语言任务中。该模型处理各种任务,同时容量适中。我们的模型在图像-文本和文本-图像检索、视频问答和开放词汇检测任务上实现了最新的结果,超过了更大、更全面地训练的基础模型。特别是考虑到其容量,它在VQA和视频字幕方面显示出极具竞争力的结果。去除实验证实了我们方法的灵活性和优势。