探测开放世界中的所有物体：通向全能物体探测 (Detecting Everything in the Open World: Towards Universal Object Detection)

In this paper, we formally address universal object detection, which aims to detect every scene and predict every category. The dependence on human annotations, the limited visual information, and the novel categories in the open world severely restrict the universality of traditional detectors. We propose \textbf{UniDetector}, a universal object detector that has the ability to recognize enormous categories in the open world. The critical points for the universality of UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces for training through the alignment of image and text spaces, which guarantees sufficient information for universal representations. 2) it generalizes to the open world easily while keeping the balance between seen and unseen classes, thanks to abundant information from both vision and language modalities. 3) it further promotes the generalization ability to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7k categories, the largest measurable category size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot generalization ability on large-vocabulary datasets like LVIS, ImageNetBoxes, and VisualGenome - it surpasses the traditional supervised baselines by more than 4\% on average without seeing any corresponding images. On 13 public detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3\% amount of training data.

翻译：在本文中，我们正式讨论通用物体探测，旨在探测每个场景并预测每个类别。传统探测器对人类标注的依赖性，有限的视觉信息以及开放世界中的新类别严重限制了它的通用性。我们提出了UniDetector，一种通用物体探测器，具有识别开放世界中大量类别的能力。UniDetector的通用性关键点为：1）通过对齐图像和文本空间，它利用来自多个来源和异构标签空间的图像进行训练，以保证通用表示的充分信息。2）它易于推广到开放世界，同时保持已知类别和未知类别之间的平衡，得益于来自视觉和语言模态的丰富信息。3）它通过我们提出的解耦式训练方式和概率校准进一步促进了对新类别的泛化能力。这些贡献使UniDetector能够检测超过7k个类别，这是迄今可测量的最大类别大小，仅有约500个类别参与训练。在大词汇数据集如LVIS、ImageNetBoxes和VisualGenome上，我们的UniDetector表现出强大的零样本泛化能力——在不看到任何相应图像的情况下，平均超过传统监督基线4％以上。在13个具有不同场景的公共探测数据集上，UniDetector也仅需要3％的训练数据就可以实现最先进的性能。