Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance-level feature representation. Supervised or weakly supervised object detection methods are not the appropriate solutions due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance-level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of instance regions. On the one hand, this kind of hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in real scenarios. On the other hand, the non-leaf nodes and leaf nodes on the hierarchy correspond to the instance regions in different granularities within an image. Therefore, features in uniform length are produced for these instance regions, which may cover across a dominant image region, an integral of multiple instances, or various individual instances. Such a collection of features allows us to unify the image retrieval, multi-instance search, and instance search into one framework. The empirical studies on three benchmarks show that such an instance-level descriptor remains effective on both the known and unknown object categories. Moreover, the superior performance is achieved on single-instance and multi-instance search, as well as image retrieval tasks.
翻译:尽管深度特征在基于内容的图像检索中取得了巨大成功,但由于缺乏有效的实例级特征表示,视觉实例检索仍然面临挑战。监督或弱监督的目标检测方法因其在未知物体类别上表现不佳,并非合适的解决方案。本文基于自监督视觉Transformer(ViT)输出的特征集,将实例级区域发现建模为以分层方式检测紧凑特征子集。这种分层分解产生了一个实例区域的层次结构。一方面,这种分层分解很好地解决了现实场景中普遍存在的物体嵌入和遮挡问题。另一方面,层次结构中的非叶节点和叶节点对应图像中不同粒度的实例区域。因此,为这些实例区域生成了统一长度的特征,这些特征可能覆盖主导图像区域、多个实例的整合体或各种独立实例。这种特征集合使我们能够将图像检索、多实例检索和实例检索统一到一个框架中。在三个基准数据集上的实证研究表明,这种实例级描述符在已知和未知物体类别上均保持有效。此外,在单实例检索、多实例检索以及图像检索任务中均取得了优越性能。