Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA$^+$ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.
翻译:开放词汇检测(OVD)是一项物体检测任务,旨在检测出训练器基础分类之外的新类别的物体。最近的OVD方法依赖于大规模的视觉语言预训练模型,如CLIP,用于识别新物体。我们确定了将这些模型纳入检测器训练时需要克服的两个核心障碍:(1)应用在整个图像上训练的VL模型到区域识别任务时存在的分布偏差;(2)难以定位未见类别的物体。为了克服这些障碍,我们提出了CORA,这是一种DETR风格的框架,通过区域提示和锚预匹配,将CLIP适应于开放词汇检测。区域提示通过提示基于CLIP的区域分类器的区域特征来减轻整体到区域分布差异。锚点预匹配通过一种类别感知的匹配机制帮助学习可推广的物体定位。我们在COCO OVD基准测试上评估CORA,其中我们在新类别上实现了41.7 AP50的性能,即使不依赖额外的训练数据,也超过了以前的SOTA 2.4 AP50。当有额外的训练数据可用时,我们在真实基础分类注释和由CORA计算的额外伪包围盒标签上训练CORA$^+$。CORA$^+$在COCO OVD基准测试中实现了43.1 AP50的性能和28.1的LVIS OVD基准测试的框APR。