Fully supervised semantic segmentation learns from dense masks, which requires heavy annotation cost for closed set. In this paper, we use natural language as supervision without any pixel-level annotation for open world segmentation. We call the proposed framework as FreeSeg, where the mask is freely available from raw feature map of pretraining model. Compared with zero-shot or openset segmentation, FreeSeg doesn't require any annotated masks, and it widely predicts categories beyond class-agnostic unsupervised segmentation. Specifically, FreeSeg obtains free mask from Image-Text Similarity Map (ITSM) of Interpretable Contrastive Language-Image Pretraining (ICLIP). And our core improvements are the smoothed min pooling for dense ICLIP, with the partial label and pixel strategies for segmentation. Furthermore, FreeSeg is very straight forward without complex design like grouping, clustering or retrieval. Besides the simplicity, the performances of FreeSeg surpass previous state-of-the-art at large margins, e.g. 13.4% higher at mIoU on VOC dataset in the same settings.
翻译:完全监督的语义分解从浓密的遮罩中学习, 需要为闭合的片段支付沉重的批注费用。 本文中, 我们使用自然语言作为监督工具, 没有为开放世界分割做任何像素级的批注。 我们称之为 FreeSeg 。 我们将拟议框架称为 FreeSeg, 这里的遮罩可以从预训练模型的原始特征图中免费获得。 与零光或开源分解相比, FreeSeg 不需要任何附加说明的遮罩, 并且它广泛预测类别超过类级、 不可控的、 不受监督的分解。 具体地说, FreeSeg 在大边缘的图像- 极近似性语言图( ICLIP) 中获得了免费的遮罩 。 我们的核心改进是密度的ICLIP 平滑的薄膜集合, 其部分标签和断裂式平流战略 。 此外, FreeSeseg 没有像组合、 集或检索等复杂设计, 。 除了简单性外, FreeSeg 的性表现超过以往在大边缘的状态, 例如 高13.4% 。