Recent advances in detecting arbitrary objects in the real world are trained and evaluated on object detection datasets with a relatively restricted vocabulary. To facilitate the development of more general visual object detection, we propose V3Det, a vast vocabulary visual detection dataset with precisely annotated bounding boxes on massive images. V3Det has several appealing properties: 1) Vast Vocabulary: It contains bounding boxes of objects from 13,029 categories on real-world images, which is 10 times larger than the existing large vocabulary object detection dataset, e.g., LVIS. 2) Hierarchical Category Organization: The vast vocabulary of V3Det is organized by a hierarchical category tree which annotates the inclusion relationship among categories, encouraging the exploration of category relationships in vast and open vocabulary object detection. 3) Rich Annotations: V3Det comprises precisely annotated objects in 245k images and professional descriptions of each category written by human experts and a powerful chatbot. By offering a vast exploration space, V3Det enables extensive benchmarks on both vast and open vocabulary object detection, leading to new observations, practices, and insights for future research. It has the potential to serve as a cornerstone dataset for developing more general visual perception systems.
翻译:近年来,现实世界中检测任意对象的最新进展是在具有相对受限的词汇表的对象检测数据集上训练和评估的。为了促进更一般的视觉对象检测的发展,我们提出了V3Det,这是一个庞大词汇视觉检测数据集,涵盖了巨大图像上精确标注的边界框。V3Det具有几个吸引人的特点:1)庞大词汇表:它包含来自13,029个类别的对象边界框,是现有大词汇表对象检测数据集(例如LVIS)的10倍。2)分层类别组织:V3Det的庞大词汇表由分层类别树组织,注释了类别之间的包含关系,鼓励在庞大和开放的词汇表对象检测中探索类别关系。 3)丰富的注释:V3Det包括245k图像中精确注释的对象和由人类专家和强大的聊天机器人编写的每个类别的专业描述。通过提供庞大的探索空间,V3Det可以在庞大和开放的词汇表对象检测上进行广泛的基准测试,从而推动未来研究的新观察,实践和见解。它有潜力成为发展更一般的视觉感知系统的基石数据集。