Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.
翻译:开放词汇语义分割(OVSS)从根本上受到CLIP粗糙的图像级表示的阻碍,这些表示缺乏精确的像素级细节。现有的免训练方法试图通过从昂贵的外部基础模型(例如,SAM、DINO)引入先验,或对CLIP的内部特征应用静态的手工启发式规则来解决此问题。这些方法要么计算成本高昂,要么效果欠佳。我们提出了注意力精炼模块(ARM),一个轻量级的可学习模块,能有效解锁并精炼CLIP的内部潜力。与静态融合方法不同,ARM学习自适应地融合分层特征。它采用语义引导的交叉注意力块,利用鲁棒的深层特征(K,V)来选择和精炼富含细节的浅层特征(Q),然后是一个自注意力块。其关键创新在于“一次训练,随处使用”的范式。在通用数据集(例如,COCO-Stuff)上训练一次后,ARM可作为适用于各种免训练框架的通用即插即用后处理器。大量实验表明,ARM能以可忽略的推理开销持续提升多个基准测试上的基线性能,为免训练OVSS建立了一种高效且有效的范式。