Advancing object detection to open-vocabulary and few-shot transfer has long been a challenge for computer vision research. This work explores a continual learning approach that enables a detector to expand its zero/few-shot capabilities via multi-dataset vision-language pre-training. Using natural language as knowledge representation, we explore methods to accumulate "visual vocabulary" from different training datasets and unify the task as a language-conditioned detection framework. Specifically, we propose a novel language-aware detector OmDet and a novel training mechanism. The proposed multimodal detection network can resolve the technical challenges in multi-dataset joint training and it can generalize to arbitrary number of training datasets without the requirements for manual label taxonomy merging. Experiment results on COCO, Pascal VOC, and Wider Face/Pedestrian confirmed the efficacy by achieving on par or higher scores in joint training compared to training separately. Moreover, we pre-train on more than 20 million images with 4 million unique object vocabulary, and the resulting model is evaluated on 35 downstream tasks of ODinW. Results show that OmDet is able to achieve the state-of-the-art fine-tuned performance on ODinW. And analysis shows that by scaling up the proposed pre-training method, OmDet continues to improve its zero/few-shot tuning performance, suggesting a promising way for further scaling.
翻译:长期以来,将目标检测推进到开放词汇和微粒传输,一直是计算机愿景研究的一个挑战。这项工作探索了一种持续学习方法,使探测器能够通过多数据集视觉语言预培训来扩大其零/毛发能力。我们利用自然语言作为知识代表,探索从不同培训数据集中积累“视觉词汇”的方法,并将任务统一为一个有语言条件的检测框架。具体地说,我们提议建立一个有新意的语言觉察仪OmDet和一个新颖的培训机制。拟议的多式联运检测网络可以解决多数据联合培训的技术挑战,并且可以在不要求人工标签分类合并的情况下,将培训数据集的任意数量推广到任意数量。关于COCO、Pascal VOC和大面/Pestrian的实验结果通过在联合培训中达到平均或更高的分数和单独培训的检测框架,证实了这种效果。此外,我们提出了一个有400万个独特对象词汇的2 000多万图象的预选方法,并随后对ODINW的35个下游任务进行了评估。结果显示,ODDeet能够通过升级的方式改进其业绩分析。