We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.
翻译:我们探索平视变异器(ViTs)用于语义分解的能力,并推荐SegVit。 以往ViT的分化网络通常从 ViT 的输出中学习像素级代表。 不同的是, 我们使用基本部件 -- -- 注意机制, 生成语义分解的遮罩。 具体地说, 我们提议了注意到磁盘模块, 将一组可学习类符号和空间地貌图之间的相似性图解转移到分隔面罩中。 实验显示, 我们提议的SegVit使用ATM 模块比用ADE20K 数据集中的普通维特主干网的对应方要高一些, 并在COCO- Stuff- 10K 和 PACSCAL- Text 数据集上取得新的最新状态性能。 此外, 为了降低维特骨架的计算成本, 我们提议以查询为基础的下标示(QD)和基于查询的上标示(QU) 来建立Shrnock 结构。, 在Shrnock 的计算过程中, 模型可以保存 。