X$2美元-VLM:愿景-语言任务全成一体的预培训模式 (X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks)

Vision language pre-training aims to learn alignments between vision and language from a large amount of data. We proposed multi-grained vision language pre-training, a unified approach which can learn vision language alignments in multiple granularity. This paper advances the proposed method by unifying image and video encoding in one model and scaling up the model with large-scale data. We present X$^2$-VLM, a pre-trained VLM with a modular architecture for both image-text tasks and video-text tasks. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for X$^2$-VLM to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models will be available at github.com/zengyan-97/X2-VLM.

翻译：愿景语言培训前培训的目的是从大量数据中学习视觉和语言之间的匹配。我们建议了多重视觉语言培训前,这是一种统一的方法,可以在多个颗粒度中学习视觉语言的匹配。本文通过统一一个模型中的图像和视频编码并将模型与大规模数据升级,推进了拟议方法。我们展示了X$2$-VLM,这是一个经过预先培训的VLM,具有图像文本任务和视频文本任务的模块结构。实验结果表明,X$2$-VLM在图像文本和视频文本任务上都具有最佳的基础和大尺度,在业绩和模型尺度之间实现良好的交换。此外,我们展示了X$2$-VLM模块设计,使X$2$-VLM在任何语言或领域都可使用。例如,简单地用XLM-R,X$2$2$-VLMM 来取代文本编码,在图像文本多版本和视频文本任务上都处于最佳状态,在业绩和模型之间实现良好的平衡。此外,我们将在任何多维-培训前的模型和M-M培训前,在任何多维-L培训前的模型上是可用的。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日