Vision language pre-training aims to learn alignments between vision and language from a large amount of data. We proposed multi-grained vision language pre-training, a unified approach which can learn vision language alignments in multiple granularity. This paper advances the proposed method by unifying image and video encoding in one model and scaling up the model with large-scale data. We present X$^2$-VLM, a pre-trained VLM with a modular architecture for both image-text tasks and video-text tasks. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for X$^2$-VLM to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models will be available at github.com/zengyan-97/X2-VLM.
翻译:愿景语言培训前培训的目的是从大量数据中学习视觉和语言之间的匹配。我们建议了多重视觉语言培训前,这是一种统一的方法,可以在多个颗粒度中学习视觉语言的匹配。本文通过统一一个模型中的图像和视频编码并将模型与大规模数据升级,推进了拟议方法。我们展示了X$2$-VLM,这是一个经过预先培训的VLM,具有图像文本任务和视频文本任务的模块结构。实验结果表明,X$2$-VLM在图像文本和视频文本任务上都具有最佳的基础和大尺度,在业绩和模型尺度之间实现良好的交换。此外,我们展示了X$2$-VLM模块设计,使X$2$-VLM在任何语言或领域都可使用。例如,简单地用XLM-R,X$2$2$-VLMM 来取代文本编码,在图像文本多版本和视频文本任务上都处于最佳状态,在业绩和模型之间实现良好的平衡。此外,我们将在任何多维-培训前的模型和M-M培训前,在任何多维-L培训前的模型上是可用的。