Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the large window attention to capture the contextual information at multiple scales. Moreover, the framework of spatial pyramid pooling is adopted to collaborate with the large window attention, which presents a novel decoder named large window attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT. Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical results demonstrate that Lawin Transformer offers an improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4\% mIoU), ADE20K (56.2\% mIoU) and COCO-Stuff datasets. The code will be released at https://github.com/yan-hao-tian/lawin.
翻译:多尺度的表达方式对于语义分解至关重要。 社区见证了语义分解神经神经网络( CNN) 开发多尺度背景信息的兴盛。 受视觉变压器( ViT) 在图像分类方面强大, 最近提议了一些语义分解 ViT, 其中多数人取得了令人印象深刻的结果, 但以计算经济成本为代价。 在本文中, 我们成功地通过窗口关注机制将多尺度的表达方式引入语义分解 ViT, 并进一步提高性能和效率。 为此, 我们引入了大型窗口关注, 使本地窗口在小的计算管理上能够查询更大的环境窗口区域。 通过对上下文区域的比例进行调控, 我们让大窗口注意在多个尺度上捕捉背景信息。 此外, 空间金字塔集合框架被采纳与大型窗口关注配合, 这展示了一种新型的脱coder, 以空间金字塔集合( Lawin AL ), 我们由此产生的 ViT, Law Invertroateeral, 将演示一个高效的图像变压法 。