Most existing methods in vision language pre-training rely on object-centric features extracted through object detection and make fine-grained alignments between the extracted features and texts. It is challenging for these methods to learn relations among multiple objects. To this end, we propose a new method called X-VLM to perform `multi-grained vision language pre-training.' The key to learning multi-grained alignments is to locate visual concepts in the image given the associated texts, and in the meantime align the texts with the visual concepts, where the alignments are in multi-granularity. Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.
翻译:视觉语言培训前的大多数现有方法依赖于通过物体探测提取的以物体为中心的特征,并在提取的特征和文本之间作出细微的校准。这些方法在学习多个对象之间的关系方面是具有挑战性的。为此,我们提出一种名为X-VLM的新方法,以进行“多重视觉语言培训前”的训练。 学习多重校准的关键在于根据相关文本在图像中定位视觉概念,同时使文本与视觉概念相一致,而视觉概念的校正是多色调的。实验结果表明,X-VLM有效地利用学习的多重校准对准来完成许多下游视觉语言任务,并始终超越最先进的方法。