Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.
翻译:近些年来,语言、愿景和多模式预设培训的高度趋同;在这项工作中,我们提出了MPLUG-2,这是一个新的统一模式,为多模式预培训设计模块化设计模块化,可受益于模式合作,同时解决模式纠缠问题;与仅依赖顺序到序列生成或基于编码器的实例歧视的主要模式相比,MPLUG-2引入了多模块构成网络,共享通用模式协作模块,拆解处理模式纠缠的不同模式模块;灵活选择不同模块,在包括文本、图像和视频在内的所有模式中进行不同的理解和生成任务。 经验性研究表明,MPLUG-2在广泛的30多个下游任务中取得了最先进的或竞争性成果,跨过图像文本和视频文本理解和生成的多模式任务,以及只使用文本、只使用图像和视频理解的单模式。 注意,MPLUG-2在所有模式中选择不同的理解和生成不同的模块,包括文本、图像和视频。