We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. Existing object detection datasets only contain hundreds of categories, and it is costly to scale further. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories not seen during training. ViLD obtains 16.1 mask AP$_r$, even outperforming the supervised counterpart by 3.8 with a ResNet-50 backbone. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively. On COCO, ViLD outperforms previous SOTA by 4.8 on novel AP and 11.4 on overall AP.
翻译:我们的目标是推进露天弹道物体探测,通过任意文字输入来检测物体。基本挑战在于培训数据的可用性。现有物体探测数据集仅包含数百个类别,而且要进一步推广成本很高。为了克服这一挑战,我们提议通过视觉和语言知识蒸馏来培训VilD,这是通过视觉和语言知识蒸馏的一种培训方法。我们的方法将预先训练的露天弹形图像分类模型(教师)中的知识蒸馏成两阶段探测器(学生)。具体地说,我们使用教师模型来对物体提案的分类文本和图像区域进行编码。然后,我们培训学生探测器,其区域内嵌入的已检测盒与教师推断的文本和图像嵌入一致。我们用LVIS基准,将所有稀有的类别都作为培训期间看不到的新类别。Vild获得16.1个蒙面 AP_r$,甚至以ResNet-50骨架比受监管的对应方高3.8。该模型可以直接转移到其他数据集,而无需微调,达到72.2 AP_50美元。然后,36 AP 和11.8 AP AS ARC 分别用于《巴塞尔公约》和《巴塞尔公约》。