Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have utilized datasets with a large vocabulary that contains a large number of object classes, under the assumption that such data will enable models to extract comprehensive knowledge on the relationships between various objects and better generalize to unseen object classes. In this study, we argue that more fine-grained labels are necessary to extract richer knowledge about novel objects, including object attributes and relationships, in addition to their names. To address this challenge, we propose a simple and effective method named Pseudo Caption Labeling (PCL), which utilizes an image captioning model to generate captions that describe object instances from diverse perspectives. The resulting pseudo caption labels offer dense samples for knowledge distillation. On the LVIS benchmark, our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance. PCL's simplicity and flexibility are other notable features, as it is a straightforward pre-processing technique that can be used with any image captioning model without imposing any restrictions on model architecture or training process.
翻译:近期开放词汇检测方法旨在通过从大量图像文本对训练的视觉语言模型(VLM)中提取知识,以检测新颖物体。为了提高这些方法的效果,研究人员利用了一个包含大量物体类别的大词汇数据集,以此来假设这些数据将使模型能够提取丰富的知识,包括各种对象之间的关系和更好地推广到未见过的对象类别。在本研究中,我们认为更细致的标签对于提取关于新颖物体的更丰富的知识是必要的,包括物体属性和关系,以及它们的名称。为了解决这一挑战,我们提出了一种名为伪标题标签(PCL)的简单高效方法,它利用图像字幕模型生成从不同角度描述对象实例的字幕。其产生的伪标题标签为知识蒸馏提供了密集的样本。在LVIS基准测试中,我们在去重后的VisualGenome数据集上训练的最佳模型达到了34.5的AP和30.6的APr,性能与最先进的方法相当。PCL的简洁性和灵活性也是值得注意的特点,它是一种简单的预处理技术,可以与任何图像字幕模型一起使用,而不会对模型架构或训练过程施加任何限制。