Removing out-of-distribution (OOD) images from noisy images scraped from the Internet is an important preprocessing for constructing datasets, which can be addressed by zero-shot OOD detection with vision language foundation models (CLIP). The existing zero-shot OOD detection setting does not consider the realistic case where an image has both in-distribution (ID) objects and OOD objects. However, it is important to identify such images as ID images when collecting the images of rare classes or ethically inappropriate classes that must not be missed. In this paper, we propose a novel problem setting called in-distribution (ID) detection, where we identify images containing ID objects as ID images, even if they contain OOD objects, and images lacking ID objects as OOD images. To solve this problem, we present a new approach, \textbf{G}lobal-\textbf{L}ocal \textbf{M}aximum \textbf{C}oncept \textbf{M}atching (GL-MCM), based on both global and local visual-text alignments of CLIP features, which can identify any image containing ID objects as ID images. Extensive experiments demonstrate that GL-MCM outperforms comparison methods on both multi-object datasets and single-object ImageNet benchmarks.
翻译:摘要:从互联网上爬取的嘈杂图像中去除已知分布外(OOD)的图像,是构建数据集的重要预处理,它可以通过使用视觉语言基础模型(CLIP)进行零样本OOD检测来解决。然而,现有的零样本OOD检测设置并不考虑实际情况,即图像同时包含内部分布(ID)物体和OOD物体的情况。然而,当收集罕见类别或道德上不适当类别的图像时,识别此类图像为ID图像非常重要。在本文中,我们提出了一种新颖的问题设置,称为内部分布(ID)检测,其中标识包含ID物体的图像为ID图像,即使它们包含OOD物体,而缺乏ID物体的图像则为OOD图像。为了解决这个问题,我们提出了一种新的方法称为GL-MCM(Global-Local Maximum Concept Matching),基于CLIP特征的全局和局部视觉-文本对齐,可以识别任何包含ID物体的图像作为ID图像。大量实验证明,GL-MCM在多物体数据集和单物体ImageNet基准测试上优于比较方法。