The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code: https://fcjian.github.io/promptdet.
翻译:这项工作的目标是建立一个可扩缩的管道,利用零人工说明,将物体探测器扩大到新/非视觉类别;为此,我们提出以下四项贡献:(一) 为求全面化,我们提议一个两阶段开放词汇天体探测器,在这个平台上,将等级不可知天体建议与事先经过训练的视觉语言模型的文字编码器进行分类;(二) 将视觉潜伏空间(RPN框建议)与经过训练的文本编码器空间进行配对,我们建议通过区域即时学习,使文字嵌入空间与区域视觉物体特征相协调;(三) 为扩大学习程序,以探测更广泛的物体范围,我们通过新的自我培训框架利用现有的在线资源,以便能够将拟议的探测器训练在大量不精确的网络图像上。最后,(四) 评估我们拟议的探测器(称为TearDet),我们对具有挑战性的LVIS和MS-COCO数据集进行广泛的实验。