Existing machine learning models demonstrate excellent performance in image object recognition after training on a large-scale dataset under full supervision. However, these models only learn to map an image to a predefined class index, without revealing the actual semantic meaning of the object in the image. In contrast, vision-language models like CLIP are able to assign semantic class names to unseen objects in a `zero-shot' manner, although they still rely on a predefined set of candidate names at test time. In this paper, we reconsider the recognition problem and task a vision-language model to assign class names to images given only a large and essentially unconstrained vocabulary of categories as prior information. We use non-parametric methods to establish relationships between images which allow the model to automatically narrow down the set of possible candidate names. Specifically, we propose iteratively clustering the data and voting on class names within them, showing that this enables a roughly 50\% improvement over the baseline on ImageNet. Furthermore, we tackle this problem both in unsupervised and partially supervised settings, as well as with a coarse-grained and fine-grained search space as the unconstrained dictionary.
翻译:翻译的摘要:
现有的机器学习模型在全监督下训练大规模数据集后,在图像物体识别方面表现出色。然而,这些模型仅学习将图像映射到预定义的类索引,而不揭示图像中物体的实际语义含义。相比之下,诸如CLIP的视觉语言模型能够以`零样本'方式为看不见的物体分配语义类别名称,尽管它们在测试时仍依赖预定义的候选名称集合。在本文中,我们重新考虑识别问题,并要求视觉语言模型在仅给定大规模的、本质上是无约束的类别词汇表的情况下,分配图像的类别名称。我们使用非参数化方法来建立图像之间的关系,从而使模型自动缩小可能的候选名称集合。具体而言,我们建议对数据进行迭代聚类,并在其中对类名进行投票,表明这使得在ImageNet上基准测试中获得了约50\%的改进。此外,我们在无监督和部分监督设置以及有粗糙和细粒度搜索空间的情况下都解决了这个问题。