Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting. On the COCO benchmark, our proposed approach achieves 36.6 AP50 on novel classes, an absolute 8.2 gain over the previous best performance. For LVIS, we surpass the state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall. Code: https://github.com/hanoonaR/object-centric-ovd.
翻译:现有的开放词汇对象探测器通常通过利用不同形式的薄弱监督来扩大其词汇规模。 这有助于对推断中的新对象进行概括化。 两种在开放词汇检测中使用的常用的薄弱监督模式包括预先培训的 CLIP 模型和图像级监督。 我们注意到,这两种监督模式都与探测任务没有最佳一致: CLIP 使用图像- 文本配对培训,并且缺乏对对象的精确本地化,而图像级监督则使用不准确指定当地目标区域的超常性能监管。 在这项工作中,我们建议通过对CLIP 模型中嵌入的语言进行以对象为中心的目标中心调整来解决这一问题。 此外,我们用一个提供高质量对象建议书的假标签程序将目标置于仅图像级监督之下,并有助于在培训期间扩大词汇。 我们通过一种新式重力转换功能,将上述两个目标的定位战略连接起来。 实质上, 拟议的模型试图将对象和ViR- 目标区域之间的图像中心显示差距缩小到最小的距离上。 在OV- diveD 类中, 我们提出的绝对的ARV- develop Streal- developal- develop sal- develop the State State sal as.