In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.
翻译:在本文中,我们展示了一个开放设置的物体探测器,称为DINO,与基于变异器的探测器DINO结合,进行有根有据的培训前训练,该探测器能够用人类输入的任意物体探测,例如类别名称或参考表达式。开放设置物体探测的关键解决办法是向封闭设置的探测器引入语言,以便开放设置概念的概括化。为了有效地结合语言和视觉模式,我们概念上将封闭设置的探测器分为三个阶段,并提出一个紧凑的聚合解决方案,其中包括一个特性增强器、一个语言指导查询选择和跨模式融合的跨模式调调调调调器。虽然以前的工作主要评估新类别中的开放设置物体探测,但我们建议对调用指定特性对象的表达理解性进行评价。 将DINO引入所有三种环境,包括CO、LVIS、ODinW和RefCO/+/g的基准, 将DINOO以52.5美元获得新的CO检测零光源转移基准(i.a.ref-ODA/real-ODA breal degal CO),它将在零ODOD/ODODODODODl 上建立一个基准数据。</s>