缩小开放词汇探测对象与图像级别代表之间的差距 (Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection)

Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting. On the COCO benchmark, our proposed approach achieves 36.6 AP50 on novel classes, an absolute 8.2 gain over the previous best performance. For LVIS, we surpass the state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall. Code: https://github.com/hanoonaR/object-centric-ovd.

翻译：现有的开放词汇对象探测器通常通过利用不同形式的薄弱监督来扩大其词汇规模。这有助于对推断中的新对象进行概括化。两种在开放词汇检测中使用的常用的薄弱监督模式包括预先培训的 CLIP 模型和图像级监督。我们注意到,这两种监督模式都与探测任务没有最佳一致: CLIP 使用图像- 文本配对培训,并且缺乏对对象的精确本地化,而图像级监督则使用不准确指定当地目标区域的超常性能监管。在这项工作中,我们建议通过对CLIP 模型中嵌入的语言进行以对象为中心的目标中心调整来解决这一问题。此外,我们用一个提供高质量对象建议书的假标签程序将目标置于仅图像级监督之下,并有助于在培训期间扩大词汇。我们通过一种新式重力转换功能,将上述两个目标的定位战略连接起来。实质上, 拟议的模型试图将对象和ViR- 目标区域之间的图像中心显示差距缩小到最小的距离上。在OV- diveD 类中, 我们提出的绝对的ARV- develop Streal- developal- develop sal- develop the State State sal as.

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日