Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a "real" open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset.
翻译:开放世界检测方法从图像文本配对的大规模视觉语言预培训中受益, 显示在零光或几光探测设置下, 超强的通用能力。 但是, 在现有方法的推论阶段, 仍然需要预设的类别空间, 只有属于该空间的物体才会被预测。 为了在本文中引入“ 真实的” 开放世界检测器, 我们提议了一个名为 CapDet 的新颖方法, 要么在特定类别列表下预测, 要么直接生成预测的捆绑框类别。 具体地说, 我们将开放世界的检测和密集字幕任务合并成一个单一但有效的框架, 引入一个额外的密集字幕头来生成区域基底字幕。 此外, 添加说明任务将有利于检测性能的普遍化, 因为说明数据集包含更多的概念。 实验结果表明, 通过整合密集的字幕任务, 我们的CapDet在 LVIS (1203) 类的基准方法上, +2. 1% mAP 和 15-CO- AP 的州级, VG. 的V. g. 和 VG.</s>