Most existing methods in vision language pre-training rely on object-centric features extracted through object detection, and make fine-grained alignments between the extracted features and texts. We argue that the use of object detection may not be suitable for vision language pre-training. Instead, we point out that the task should be performed so that the regions of `visual concepts' mentioned in the texts are located in the images, and in the meantime alignments between texts and visual concepts are identified, where the alignments are in multi-granularity. This paper proposes a new method called X-VLM to perform `multi-grained vision language pre-training'. Experimental results show that X-VLM consistently outperforms state-of-the-art methods in many downstream vision language tasks.
翻译:视觉语言培训前的大多数现有方法依赖于通过物体探测提取的以物体为中心的特征,并在提取的特征和文本之间作出细微的校准。我们争辩说,物体探测可能不适合视觉语言培训前使用。相反,我们指出,应当执行这项任务,以便使文本中提及的“视觉概念”区域位于图像中,同时确定文本和视觉概念之间的校准,使这些校准在多角度上。本文提出了一种名为X-VLM的新方法,以进行“多角度视觉语言培训前”的测试。实验结果显示,X-VLM在许多下游视觉语言任务中始终超越了最先进的方法。