mPLUG-2:一个模块化的多模式基金会模型,横跨文字、图像和视频 (mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video) - 专知论文

会员服务 ·

0

模态 · state-of-the-art · MoDELS · 可理解性 · 视频描述生成（Video Caption） ·

2023 年 2 月 1 日

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

翻译：mPLUG-2:一个模块化的多模式基金会模型,横跨文字、图像和视频

Haiyang Xu,Qinghao Ye,Ming Yan,Yaya Shi,Jiabo Ye,Yuanhong Xu,Chenliang Li,Bin Bi,Qi Qian,Wei Wang,Guohai Xu,Ji Zhang,Songfang Huang,Fei Huang,Jingren Zhou

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

翻译：近些年来,语言、愿景和多模式预设培训的高度趋同;在这项工作中,我们提出了MPLUG-2,这是一个新的统一模式,为多模式预培训设计模块化设计模块化,可受益于模式合作,同时解决模式纠缠问题;与仅依赖顺序到序列生成或基于编码器的实例歧视的主要模式相比,MPLUG-2引入了多模块构成网络,共享通用模式协作模块,拆解处理模式纠缠的不同模式模块;灵活选择不同模块,在包括文本、图像和视频在内的所有模式中进行不同的理解和生成任务。经验性研究表明,MPLUG-2在广泛的30多个下游任务中取得了最先进的或竞争性成果,跨过图像文本和视频文本理解和生成的多模式任务,以及只使用文本、只使用图像和视频理解的单模式。注意,MPLUG-2在所有模式中选择不同的理解和生成不同的模块,包括文本、图像和视频。

2

相关内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

北京/上海内推 | 索尼中国研究院招聘计算机视觉研究员

北京/上海内推 | 索尼中国研究院招聘计算机视觉研究员

PaperWeekly

0+阅读 · 2022年5月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

分子团簇负离子束沉积超薄BiSe二维拓扑绝缘体

国家自然科学基金

0+阅读 · 2012年12月31日

microRNA调节肿瘤抑制因子Caliban应答DNA损伤的机制

国家自然科学基金

1+阅读 · 2012年12月31日

胶质瘤表达抗原2（GLEA2)通过ROS-JNK通路对神经胶质瘤杀伤作用的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

HIV准种变异程度对3TC耐药性产生的影响研究

国家自然科学基金

0+阅读 · 2011年12月31日

Puma和Bim在慢性淋巴细胞白血病细胞凋亡中的作用机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

Wip1对中性粒细胞的负性调节效应及其分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

趋化因子CCL2和CX3CL1在泰素诱导触诱发痛中的作用及机制

国家自然科学基金

0+阅读 · 2010年12月31日

基于电磁理论分析与模拟的“#20912;穹A-中山站”#26029;面冰盖内部结构与物性定量表征方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

等离子体惯性效应在行星际磁通量绳结构重建中的作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

Arxiv

0+阅读 · 2023年3月24日

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Arxiv

1+阅读 · 2023年3月23日

Medical diffusion on a budget: textual inversion for medical image generation

Arxiv

0+阅读 · 2023年3月23日

Text with Knowledge Graph Augmented Transformer for Video Captioning

Arxiv

0+阅读 · 2023年3月22日

BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency

Arxiv

0+阅读 · 2023年3月22日

VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Arxiv

0+阅读 · 2023年3月22日

Positive-Augmented Constrastive Learning for Image and Video Captioning Evaluation

Arxiv

0+阅读 · 2023年3月21日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

Hierarchical Graph Representation Learning with Differentiable Pooling

Hierarchical Graph Representation Learning with Differentiable Pooling

Arxiv

13+阅读 · 2018年6月26日

Video Captioning via Hierarchical Reinforcement Learning

Arxiv

20+阅读 · 2018年3月29日

VIP会员

文章信息

相关主题

state-of-the-art

视频描述生成（Video Caption）

相关VIP内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】扩展可扩展会话推荐的边界

别想太多：高效 R1 风格大型推理模型综述

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

智能体网络：用AI智能体编织下一代网络

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

北京/上海内推 | 索尼中国研究院招聘计算机视觉研究员

北京/上海内推 | 索尼中国研究院招聘计算机视觉研究员

PaperWeekly

0+阅读 · 2022年5月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

相关论文

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

Arxiv

0+阅读 · 2023年3月24日

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Arxiv

1+阅读 · 2023年3月23日

Medical diffusion on a budget: textual inversion for medical image generation

Arxiv

0+阅读 · 2023年3月23日

Text with Knowledge Graph Augmented Transformer for Video Captioning

Arxiv

0+阅读 · 2023年3月22日

BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency

Arxiv

0+阅读 · 2023年3月22日

VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Arxiv

0+阅读 · 2023年3月22日

Positive-Augmented Constrastive Learning for Image and Video Captioning Evaluation

Arxiv

0+阅读 · 2023年3月21日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

Hierarchical Graph Representation Learning with Differentiable Pooling

Hierarchical Graph Representation Learning with Differentiable Pooling

Arxiv

13+阅读 · 2018年6月26日

Video Captioning via Hierarchical Reinforcement Learning

Arxiv

20+阅读 · 2018年3月29日

相关基金

分子团簇负离子束沉积超薄BiSe二维拓扑绝缘体

国家自然科学基金

0+阅读 · 2012年12月31日

microRNA调节肿瘤抑制因子Caliban应答DNA损伤的机制

国家自然科学基金

1+阅读 · 2012年12月31日

胶质瘤表达抗原2（GLEA2)通过ROS-JNK通路对神经胶质瘤杀伤作用的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

HIV准种变异程度对3TC耐药性产生的影响研究

国家自然科学基金

0+阅读 · 2011年12月31日

Puma和Bim在慢性淋巴细胞白血病细胞凋亡中的作用机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

Wip1对中性粒细胞的负性调节效应及其分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

趋化因子CCL2和CX3CL1在泰素诱导触诱发痛中的作用及机制

国家自然科学基金

0+阅读 · 2010年12月31日

基于电磁理论分析与模拟的“#20912;穹A-中山站”#26029;面冰盖内部结构与物性定量表征方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

等离子体惯性效应在行星际磁通量绳结构重建中的作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员