Despite great progress in object detection, most existing methods are limited to a small set of object categories, due to the tremendous human effort needed for instance-level bounding-box annotation. To alleviate the problem, recent open vocabulary and zero-shot detection methods attempt to detect object categories not seen during training. However, these approaches still rely on manually provided bounding-box annotations on a set of base classes. We propose an open vocabulary detection framework that can be trained without manually provided bounding-box annotations. Our method achieves this by leveraging the localization ability of pre-trained vision-language models and generating pseudo bounding-box labels that can be used directly for training object detectors. Experimental results on COCO, PASCAL VOC, Objects365 and LVIS demonstrate the effectiveness of our method. Specifically, our method outperforms the state-of-the-arts (SOTA) that are trained using human annotated bounding-boxes by 3% AP on COCO novel categories even though our training source is not equipped with manual bounding-box labels. When utilizing the manual bounding-box labels as our baselines do, our method surpasses the SOTA largely by 8% AP.
翻译:尽管在物体探测方面取得了巨大进展,但大多数现有方法仍限于一小套物体类别,这是因为需要大量人力努力,例如,等级捆绑盒说明。为了缓解问题,最近的开放词汇和零发检测方法试图探测培训期间看不到的物体类别。然而,这些方法仍然依赖一组基级上人工提供的捆绑盒说明。我们建议了一个开放式词汇检测框架,无需人工提供捆绑盒说明即可进行培训。我们的方法是通过利用预先训练的视觉语言模型的本地化能力以及生成可直接用于训练物体探测器的假捆绑盒标签来实现这一点的。COCO、PaSCAL VOC、Oites365和LVIS的实验结果显示了我们的方法的有效性。具体地说,我们的方法超越了在使用3% AP 的附加说明框对COCO 新型类别的人类进行培训的状态(SOATA),尽管我们的培训来源没有配备手动捆绑箱标签。在使用手工绑框标签时,我们的方法基本上超过了AP % 。