This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code will be released at https://github.com/microsoft/GLIP.
翻译:本文为学习目标级别、语言觉醒和语义丰富的视觉表现提供了一个基础语言图像培训前(GLIP)模型,用于学习目标级别、语言觉醒和语义丰富的视觉表现。GLIP统一了对象探测和训练前的定位。统一带来了两个好处:(1) 使GLIP能够从探测和定位数据中学习,以改进任务和陷阱的良好定位模式;(2) GLIP能够利用大型图像文本对齐,以自我培训方式制作地面盒子,使学习的语义代表丰富。在我们实验中,我们预先在27M地面数据上对GLIP进行定位,包括3M人注解和24M网络画成图像文本配对。学习的演示显示,从探测和定位数据中学习了强大的零发和几发可转移到各种目标层面的识别任务。 (1) 当直接评价COCO和LVIS(在培训前看不到COCO的任何图像)时,GIP达到49.8 AP和26.9 AP,分别超过许多监管的基线。(2) 在对COCOO进行微调后,GLIP头头的测试后,GLIP第1号测试第1号测试第13号。