Despite great progress in object detection, most existing methods work only on a limited set of object categories, due to the tremendous human effort needed for bounding-box annotations of training data. To alleviate the problem, recent open vocabulary and zero-shot detection methods attempt to detect novel object categories beyond those seen during training. They achieve this goal by training on a pre-defined base categories to induce generalization to novel objects. However, their potential is still constrained by the small set of base categories available for training. To enlarge the set of base classes, we propose a method to automatically generate pseudo bounding-box annotations of diverse objects from large-scale image-caption pairs. Our method leverages the localization ability of pre-trained vision-language models to generate pseudo bounding-box labels and then directly uses them for training object detectors. Experimental results show that our method outperforms the state-of-the-art open vocabulary detector by 8% AP on COCO novel categories, by 6.3% AP on PASCAL VOC, by 2.3% AP on Objects365 and by 2.8% AP on LVIS. Code is available at https://github.com/salesforce/PB-OVD.
翻译:尽管在物体探测方面取得了巨大进展,但大多数现有方法仅针对有限的一组物体类别开展工作,这是因为需要大量人力努力才能对培训数据进行捆绑框说明。为了缓解问题,最近的开放词汇和零光探测方法试图探测培训期间所见以外的新物体类别。它们通过对预先界定的基准类别进行培训,以促使对新物体进行概括化,从而实现这一目标。然而,它们的潜力仍然受到可供培训的小规模基本类别组合的制约。为了扩大基础类别组,我们建议了一种方法,以自动生成大型图像胶片组合中不同物体的假捆绑框说明。我们的方法利用了预先训练的视觉语言模型的本地化能力,生成假绑框标签,然后直接用于培训物体探测器。实验结果表明,我们的方法超越了先进的开放词汇探测器,即8%的开放词汇探测器,8%的密码是COCOCO的,6.3%的软件类别是AP,PASCAL VOC,2.3%的索引是AP-365,2.8%的编码是LVIS。https://github./Asires/forence。