Existing approaches to vision-language pre-training (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal fusion. Despite their superior performance, these approaches are bounded by the capability of the object detector in terms of both effectiveness and efficiency. Besides, the presence of object detection imposes unnecessary constraints on model designs and makes it difficult to support end-to-end training. In this paper, we revisit grid-based convolutional features for vision-language pre-training, skipping the expensive region-related steps. We propose a simple yet effective grid-based VLP method that works surprisingly well with the grid features. By pre-training only with in-domain datasets, the proposed Grid-VLP method can outperform most competitive region-based VLP methods on three examined vision-language understanding tasks. We hope that our findings help to further advance the state of the art of vision-language pre-training, and provide a new direction towards effective and efficient VLP.
翻译:现有视觉学前训练方法(VLP)严重依赖基于捆绑框的物体探测器(区域),首先从图像中检测出突出的物体,然后将基于变异器的模型用于跨模式融合。尽管这些方法的性能优异,但受对象探测器在效力和效率两方面的能力的约束。此外,发现物体对模型设计造成不必要的限制,难以支持端对端培训。在本文件中,我们重新审视基于网格的视觉学前训练同级特征,跳过昂贵的与区域有关的步骤。我们提出了一个简单而有效的网格VLP方法,该方法与网格特征极为配合。拟议的Grid-VLP方法只能用内部数据集进行预先培训,在三种经过审查的视觉语言理解任务上比最具竞争力的基于区域的VLP方法要强。我们希望我们的发现有助于进一步推动视觉学前训练的状态,并提供一个面向有效和高效VLP的新方向。