Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal tasks, where VLP models without reliance on object detectors are becoming the mainstream due to their superior computation efficiency and competitive performance. However, the removal of object detectors also deprives the capability of VLP models in explicit object modeling, which is essential to various position-sensitive vision-language (VL) tasks, such as referring expression comprehension and visual commonsense reasoning. To address the challenge, we introduce PEVL that enhances the pre-training and prompt tuning of VLP models with explicit object position modeling. Specifically, PEVL reformulates discretized object positions and language in a unified language modeling framework, which facilitates explicit VL alignment during pre-training, and also enables flexible prompt tuning for various downstream tasks. We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs. We make the data and code for this paper publicly available at https://github.com/thunlp/PEVL.
翻译:视力前培训(VLP)在一系列广泛的跨模式任务上表现出了令人印象深刻的成绩,其中不依赖物体探测器的VLP模型因其计算效率和竞争性性能优异而成为主流;然而,物体探测器的去除还剥夺了VLP模型在明确目标模型方面的能力,而这种模型对于各种对位置敏感的视觉语言任务至关重要,例如参考表达理解和视觉常识推理。为了应对挑战,我们引入了PEVLL,这加强了对具有明确物体定位模型的VLP模型的预先培训和迅速调整。具体地说,PEVL在统一的语文模型框架内重新配置分散的物体位置和语言,这有利于在培训前的明确VL调整,也有利于灵活地迅速调整各种下游任务。我们表明,PEVL能够对定位敏感任务(例如参考表达理解和语调)进行最先进的无探测器VLP模型的性能。我们还用基础投入改进了定位不敏感任务的性能。我们将本文的数据和代码公开提供给 http://VPE/VGI。