Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.
翻译:经过事先培训的视觉语言模型(VLMS)学会如何在大型数据集上统一愿景和语言表述方式,每个图像文本对应方通常包含一袋语义概念;然而,现有的开放词汇对象探测器只将单个区域嵌入与从VLMS中提取的相应特征相匹配。这种设计使语义概念的构成结构在尚未开发的场景中留下,尽管该结构可能是VLMS隐含地学习的。在这项工作中,我们提议对单个区域以外区域包包的嵌进行统一。拟议的方法组群将背景上相互关联的区域组合作为包包包。包中各区域的嵌被视为将文字嵌入一个句子,并发送给VLLMM的文本编码器,以获得一个被冻结的VLMM系统所提取的相应特征的组合。我们的方法应用到通常使用的更快R-CNN系统,超过了4.6箱 AP50和2.8 APMA在开放语言/COSVS的新型模型中的最佳结果。</s>