This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13X more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP, respectively, and even beats its fully-supervised counterpart by a large margin.
翻译:本文介绍了DetCLIPv2,一种高效且可扩展的训练框架,它利用大规模图像文本对实现了开放词汇物体检测(OVD)。与先前的OVD框架通常依赖于预训练的视觉语言模型(例如CLIP)或通过伪标签化过程利用图像文本对不同,DetCLIPv2直接从大规模图像文本对中以端到端的方式学习细粒度的单词-区域对齐。为了实现这一目标,我们采用了最大单词-区域相似度来指导对比学习目标,并以检测、定位和图像文本对数据的混合监督为统一数据形式下的DetCLIPv2的训练提供支持。通过联合训练和采用低分辨率输入的交替方案,DetCLIPv2能够高效有效地利用图像文本对数据:DetCLIPv2比DetCLIP在类似的训练时间内利用了多达13倍的图像文本对,并且改进了性能。早期的预训练实验表明,DetCLIPv2在开放词汇检测的性能方面表现优异,例如,使用Swin-T 主干的DetCLIPv2 在LVIS基准测试上实现了40.4%的零次AP,优于先前的工作GLIP/GLIPv2/DetCLIP分别提高了14.4/11.4/4.5% AP,甚至超过了其大量监督的对应物。