一种基于EfficientNet与CLIP的免训练开放词汇图像分割与识别框架 (A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP)

This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP's text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.

翻译：本文提出了一种新颖的免训练开放词汇图像分割与物体识别框架，该框架利用卷积神经网络EfficientNetB0进行无监督分割，并借助视觉语言模型CLIP实现开放词汇物体识别。所提出的框架采用两阶段流程：先进行无监督图像分割，再通过视觉语言对齐实现区域级识别。在第一阶段，从EfficientNetB0提取的像素级特征通过奇异值分解得到潜在表示，随后采用层次聚类算法将语义相关区域进行分割。聚类数量通过奇异值分布自适应确定。在第二阶段，分割区域通过CLIP的Vision Transformer骨干网络进行定位并编码为图像嵌入向量。文本嵌入向量则通过CLIP的文本编码器从特定类别提示语（包括用于支持开放集识别的通用"其他物体"提示）预计算生成。图像与文本嵌入向量经拼接后，通过奇异值分解投影至共享潜在特征空间以增强跨模态对齐。识别过程通过计算投影后图像与文本嵌入向量相似度的softmax函数实现。该方法在COCO、ADE20K和PASCAL VOC等标准基准数据集上进行了评估，在匈牙利交并比、精确率、召回率和F1分数等指标上均达到了最先进的性能。这些结果验证了所提框架的有效性、灵活性与泛化能力。