Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR -- hence the name OV-DETR -- which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR -- the first end-to-end Transformer-based open-vocabulary detector -- achieves non-trivial improvements over current state of the arts.
翻译:以自然语言或模擬图像的形式,根据用户输入的自然语言或外观图像,我们想以开放词汇对象检测方式,为人类计算机互动提供极大的灵活性和用户经验。为此,我们提议以DETR为基础建立一个新的开放词汇检测器 -- -- 因此名为OV-DETR -- -- 一旦经过培训,就可以检测任何具有类名或外观图像的物体。理想的是,我们想扩展开放词汇检测器,这样它就可以根据用户输入的自然语言或外观图像的形式,根据用户输入来进行捆绑式箱预测。这为人类计算机互动提供了巨大的灵活性和用户经验。为此,我们提议以DETR(类名或前文版图像)和相应的对象为基础,在测试期间学习用于普通化和普通查询的实用通信。为了培训,我们选择将 DETTR(D) 转换器变成开放语言探测器,变成开放词汇探测器的最大挑战就是,在没有访问其标签图像的情况下,无法计算新类类的分类成本矩阵矩阵矩阵矩阵矩阵。为了克服这一挑战,我们把输入的变换的C- VI-L 和升级到升级的图像,从而实现我们之前的升级的升级的图像的升级的升级的升级的升级的图像。