Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's value by studying the attribute detection performance of several foundation models. Project page https://ovad-benchmark.github.io/
翻译:视觉建模使开放词汇任务得以进行,在这种任务中,可以使用任何提示零发的文字进行预测。现有的开放词汇任务以对象类别为重点,而由于缺乏可靠的属性重点评价基准,对对象属性的研究有限。本文介绍了开放词汇属性探测(OVAD)任务和相应的OVAD基准。新任务和基准的目标是调查通过视觉语言模型获得的物体级属性信息。为此,我们创建了一个清洁和密集的附加说明的测试组,涵盖MS CO 80个对象类别的117个属性类别。它包括正面和负面说明,使得可以进行开放式词汇评估。总体而言,基准包括140万个说明。关于参考,我们为公开词汇属性探测提供了第一个基线方法。此外,我们通过研究几个基础模型的属性探测性能来展示基准的价值。项目网页 https://ovad-benchmark.github.io/