In this work, we propose an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-vocabulary detection approaches while being data-efficient. Source code is available at https://github.com/lmb-freiburg/locov .
翻译:在这项工作中,我们提出一个开放式词汇对象探测方法,该方法以图像插图配对为基础,学习探测新对象类以及一组已知类。这是一个两阶段培训方法,首先使用定位引导图像插图匹配技术,以弱力监督的方式为新类和已知类学习类标签,第二专门使用已知类说明为对象探测任务设计模型。我们显示,简单语言模型比大型背景化语言模型更适合探测新物体。此外,我们采用了一致性常规化技术,以更好地利用图像插图配对信息。我们的方法在数据效率的同时优于现有的开放式词汇检测方法。资料来源代码可在https://github.com/lmb-freiburg/locov查阅。