Recognizing an image and segmenting it into coherent regions are often treated as separate tasks. Human vision, however, has a general sense of segmentation hierarchy before recognition occurs. We are thus inspired to learn image recognition with hierarchical image segmentation based entirely on unlabeled images. Our insight is to learn fine-to-coarse features concurrently at superpixels, segments, and full image levels, enforcing consistency and goodness of feature induced segmentations while maximizing discrimination among image instances. Our model innovates vision transformers on three aspects. 1) We use adaptive segment tokens instead of fixed-shape patch tokens. 2) We create a token hierarchy by inserting graph pooling between transformer blocks, naturally producing consistent multi-scale segmentations while increasing the segment size and reducing the number of tokens. 3) We produce hierarchical image segmentation for free while training for recognition by maximizing image-wise discrimination. Our work delivers the first concurrent recognition and hierarchical segmentation model without any supervision. Validated on ImageNet and PASCAL VOC, it achieves better recognition and segmentation with higher computational efficiency.
翻译:我们的洞察力是,在超像素、片段和完整图像水平同时学习细到粗的图像特征,在最大程度的图像区分中执行特征诱导分化的一致性和良好性,同时尽量扩大图像实例之间的差别。我们的模型在三个方面创新了视觉变异器。 1)我们使用适应性的片段符号,而不是固定的片段补丁符号。2)我们通过在变压器区块之间插入图解集合,自然产生一致的多尺度分割,同时增加段幅大小和减少符号数量,从而形成象征性的等级。3)我们免费制作分层图像分解,同时培训通过尽量扩大图像歧视加以识别。我们的工作提供了第一个并行的识别和分层模型,而无需任何监督。在图像网和PASAL VOC上验证了图像网和PASAL VOC,它以更高的计算效率获得更好的识别和分层。