Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open-vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bounding box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state of the art for scalable object detection.
翻译:尽管深神经网络在物体探测方面的精度很高,但是由于监督要求,它们的训练和规模却非常昂贵。特别是,学习更多的物体类别通常要求比例性强的框注解。已经探索了微弱的监督和零光学习技术,将物体探测器的规模扩大到更多的类别,而监督程度则较低,但是这些技术没有像监督模型那样成功和被广泛采用。在本文件中,我们提出了物体探测问题的新表述,即:开放式弹道物体探测,这种探测比监管不力和零射线方法更一般、更实用、更有效。我们提出了一种新的方法,用约束箱注解来训练物体探测器,用约束箱注解来训练有限的一组物体类别,以及覆盖更多种类物体的图像显示配对,费用要低得多。我们表明,拟议的方法可以探测和定位在培训期间没有提供约束箱注解的物体,其精确度比零射线方法高得多。同时,带有约束箱注解的物体的物体几乎可以被精确地探测,这比监管的基线要好得多。因此,我们为可测量的物体定出一种新的状态。