Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3\% mIoU on the PASCAL VOC 2012 and 22.4\% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision. We open-source our code at \href{https://github.com/NVlabs/GroupViT}{https://github.com/NVlabs/GroupViT}.
翻译:组合和识别是视觉场景理解的重要组成部分,例如,用于物体探测和语义分解。通过端到端深的学习系统,图像区域的分组通常通过像素级识别标签的自上至下监督而隐含地发生。相反,在本文件中,我们提议将组合机制带回深层网络,使语义部分仅通过文本监督自动出现。我们提议了一个等级级组合愿景变异器(Group ViT),它超越常规网格结构代表制,并学习将图像区域分组成逐渐扩大的任意形状部分。我们通过对比性损失,对GroupViT和文本编码进行联合培训。只有文本监督,而没有任何像素级的注释,Group Viet 机制学会将语义区域组合起来,并成功地以零发方式向语义分解任务转移,不作任何进一步的微调。它实现了PASAL VOC/GIOU的52.3 mIO 和22.4LIL 数据库的升级化方法,要求我们在VOC/O-GI-GL 上进行我们的竞争性数据转换。