We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home
翻译:我们提出F-VLM,这是建立在冷冻视觉和语言模型基础上的一个简单的开放词汇物体探测方法。F-VLM简化了目前的多阶段培训管道,消除了对知识蒸馏或检测定制的预训练的需要。令人惊讶的是,我们观察到一个冷冻的VLM:1)保留了探测所需的对地点敏感的特征,2)是一个强大的区域分类器。我们只对探测器头进行微调,并在推断时将每个区域的探测器和VLM输出结果结合起来。F-VLM展示了令人信服的缩放行为,并实现了超过LVIS开放词汇探测基准新颖类别艺术水平的+6.5的AP改进。此外,除了大量培训加速和计算节余外,我们在CO开放语言探测基准和交叉数据集传输探测上展示了非常有竞争力的结果。代码将在https://sitesupy.gogle.com/view/f-vlm/home上公布。代码将在https://gogle.gle.gle.com/view/vlm/home上公布。</s>