The development of language models have moved from encoder-decoder to decoder-only designs. In addition, the common knowledge has it that the two most popular multimodal tasks, the generative and contrastive tasks, tend to conflict with one another, are hard to accommodate in one architecture, and further need complex adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint training of these diverse-objective tasks is simple, effective, and maximizes the weight-sharing of the model. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the SOTA on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows competitive results on VQA and Video Captioning, especially considering its size. Ablations confirm the flexibility and advantages of our approach.
翻译:随着语言模型的发展,从编码器-解码器转向了仅解码器的设计。此外,常见的两种最流行的多模态任务——生成任务和对比任务——往往会相互冲突,难以适应一个体系结构,并且需要对下游任务进行复杂的调整。我们提出了一种新的培训范式,使用解码器专门为多模态任务进行训练,这在联合学习这些不同的视觉语言任务方面非常有效。使用一种简单的模型,称为MaMMUT。它由一个单一的视觉编码器和一个文本解码器组成,并能通过文本解码器上的新的两遍方法适应对比和生成学习。我们证明了这些不同目标任务的联合训练是简单、有效的,并最大化了模型的重复使用。此外,相同的体系结构使开放词汇的物体检测和视频语言任务得到直接扩展。该模型处理了各种任务,同时规模适中。我们的模型在图像-文本和文本-图像检索、视频问答和开放词汇检测任务上实现了SOTA,优于更大、更广泛培训的基础模型。尤其是大小方面,相对于VQA和视频字幕任务,其仍能表现出有竞争力的结果。削弱结果确认我们方法的灵活性和优点。