Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a foundation model performing the best for all the understanding tasks, which we call a general foundation model. In this paper, we propose a new general foundation model, X-FM (the X-Foundation Model). X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning X-FM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the vision-language training to guide the learning of the vision encoder. Extensive experiments on benchmark datasets show that X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding.
翻译:基础模型或经过事先培训的模型大大改善了各种语言、愿景和愿景理解任务的绩效,然而,现有的基础模型只能在一种类型的任务(即语言、愿景或愿景语言)中最出色地发挥最佳作用,仍然是一个未决问题,能否建立一个最有利于所有理解任务的基础模型,我们称之为一般基础模型。在本文中,我们提议一个新的通用基础模型X-FM(X-Foundation模型)。X-FM有一个语言编码器、一个愿景编码器和一个聚合编码器,以及一个新的培训方法。培训方法包括两种新技术,从文字、图像和图像-文本对配对数据中学习X-FMMy。一个是在学习语言编码器时阻止从愿景-语言培训中的梯度。另一个是利用愿景-语言培训来指导愿景编码器的学习。关于基准数据集的广泛实验表明,X-FM能够大大超越现有的一般基础模型,并比现有的语言、愿景或愿景-理解的基础模型更好或更可比。