Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs. It is targeted for the peculiar over-smoothness of ViTs in semantic segmentation, and therefore differs from current popular paradigms of context modeling and most existing related methods reinforcing the advantage of attention. We first deliver the decoupled two-pathway network in which another pathway enhances and passes down local-patch discrepancy complementary to global representations of transformers. We then propose the spatially adaptive separation module to obtain more separate deep representations and the discriminative cross-attention which yields more discriminative region representations through novel auxiliary supervisions. The proposed methods achieve some impressive results: 1) incorporated with large-scale plain ViTs, our methods achieve new state-of-the-art performances on five widely used benchmarks; 2) using masked pre-trained plain ViTs, we achieve 68.9% mIoU on Pascal Context, setting a new record; 3) pyramid ViTs integrated with the decoupled two-pathway network even surpass the well-designed high-resolution ViTs on Cityscapes; 4) the improved representations by our framework have favorable transferability in images with natural corruptions. The codes will be released publicly.
翻译:视觉变压器将图像编码为一系列补丁, 给语义分解带来新的范式。 我们提出一个高效的本地端和全球- 区域级代表分离框架, 用于与 Vits 进行语义分解。 它针对的是Vits在语义分解中特有的超移动性, 因此与当前流行的背景建模范式不同, 以及大多数现有相关方法不同, 加强了关注的优势。 我们首先提供分解双向双向网络, 在该网络中, 另一条路径可以提升和传递本地端偏差差异, 以补充变压器的全球分布。 我们然后提出一个空间适应性分离模块, 以获得更分开的更深的表达方式和具有歧视性的交叉关注框架。 拟议的方法取得了一些令人印象深刻的结果:(1) 与大规模普通的 Vits 结合, 我们的方法在五种广泛使用的基准上实现了新的艺术状态表现;(2) 使用蒙面的事先训练的平方维特, 我们在Pasca 版背景上实现了68.9% mIOU, 设置了一个新的记录; 维塔图则在高分辨率框架上改进了我们的自然标; 。