We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.
翻译:我们提出了GLIPv2, 一种基础VL谅解模式,它既服务于本地化任务(例如物体探测、实例分割),也服务于本地化任务(例如VQA、图像字幕说明),GLIPv2 优雅地统一了本地化培训前的本地化和愿景-语言预培训(VLP),有三个培训前任务: 短语定位,作为检测任务的VL重写,区域语言对比学习,作为新的区域-语言级对比学习任务,以及掩码语言模型。这种统一不仅简化了以前的多阶段VLP程序,而且还实现了本地化和理解任务之间的互利。实验结果表明,单一的GLIPv2模型(所有模型重量都是共享的)在本地化和理解任务上接近 SoTA, 模型还显示:(1) 在开放式语言语言目标检测任务上强的零光和微调性能,以及(2) VLIP理解任务的高级地面能力。 代码将在 https://GIBus/commustoryal上发布。