Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans. To mitigate this issue, we propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations, thus enabling fine-grained alignment between image regions and textual concepts. Our method leverages a CLIP model to match image regions with template captions and then pretrains our model to align these region-text pairs in the feature space. When transferring our pretrained model to the open-vocabulary object detection tasks, our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets, respectively. Moreoever, the learned region representations support zero-shot inference for object detection, showing promising results on both COCO and LVIS datasets. Our code is available at https://github.com/microsoft/RegionCLIP.
翻译:使用图像-文本配对进行图像-图像培训前的对比语言图像(CLIP)在零光和传输学习设置中图像分类取得了令人印象深刻的成果。然而,我们表明,直接应用这些模型来识别目标检测图像区域,由于域变换而导致性能不佳:CLIP受过培训,将图像作为一个整体与文本描述相匹配,而没有记录图像区域和文本范围之间的细微对比。为缓解这一问题,我们提议了一种名为区域CLIP的新方法,该方法大大扩展了CLIP,以学习区域级的视觉显示,从而使得图像区域和文本概念之间能够进行细微的调整。我们的方法利用了CLIP模型将图像区域与模板说明相匹配,然后将模型与功能空间中的这些区域文本配对进行预先调整。在将我们预先培训的模型转换到开放式语言区域与文本检测任务时,我们的方法大大超越了CCO和LVIS数据集的新分类的艺术状态,即3.8 AP50和2.2 AP。 Moreoverever,我们学习的区域演示区域演示支持图像区域图象区/CLSGV 数据显示有希望的CO/CO 。