Most existing methods in vision language pre-training rely on object-centric features extracted through object detection, and make fine-grained alignments between the extracted features and texts. We argue that object detection may not be necessary for vision language pre-training. To this end, we propose a new method called X-VLM to perform `multi-grained vision language pre-training.' The key of learning multi-grained alignments is to locate visual concepts in the image given the associated texts, and in the meantime align the texts with the visual concepts, where the alignments are in multi-granularity. Experimental results show that X-VLM effectively leverages the learned alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.
翻译:视觉语言培训前的大多数现有方法依赖于通过物体探测提取的以物体为中心的特征,并在提取的特征和文本之间作出细微的校准。我们争辩说,视觉语言培训前可能不需要物体探测。为此,我们提议了一种名为X-VLM的新方法,以进行“多重视觉语言培训前”的学习。 学习多重对齐的关键在于根据相关文本在图像中定位视觉概念,同时使文本与视觉概念相一致,而视觉概念的校准是多面体的。实验结果显示,X-VLM有效地将学到的校准用于许多下游视觉语言任务,并始终超越最新的方法。