We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.
翻译:我们的目标是推进露天弹道物体探测,以探测任意文字输入描述的物体; 基本挑战在于培训数据的提供; 进一步增加现有物体探测数据集中包含的班级数量代价高昂; 为了克服这一挑战,我们提议通过视觉和语言知识蒸馏培训方法VILD,这是通过视觉和语言知识蒸馏方式进行的培训方法; 我们的方法将知识从预先训练的露天弹道图像分类模型(教师)中提炼成一个两阶段检测器(学生)。 具体地说,我们使用教师模型来对目标提案的分类文本和图像区域进行编码。 然后,我们培训一名学生探测器,其已检测到的盒子与教师推断的文字和图像嵌入一致。 我们用LVIS基准,将所有稀有的类别作为培训期间看不到的新类别。 VILD获得16.1 美元和 美元 ASNet-50主干线,甚至比3.8的监管对等能力。 当接受更强的师资模型培训时,VLD将达到26.3 AP_r$。 其区域嵌嵌入与教师数据库/ROD。