劳林变换器:通过大型窗口关注,用多层代表方式改进语义分层变换器 (Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention)

Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the large window attention to capture the contextual information at multiple scales. Moreover, the framework of spatial pyramid pooling is adopted to collaborate with the large window attention, which presents a novel decoder named large window attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT. Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical results demonstrate that Lawin Transformer offers an improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4\% mIoU), ADE20K (56.2\% mIoU) and COCO-Stuff datasets. The code will be released at https://github.com/yan-hao-tian/lawin.

翻译：多尺度的表达方式对于语义分解至关重要。社区见证了语义分解神经神经网络( CNN) 开发多尺度背景信息的兴盛。受视觉变压器( ViT) 在图像分类方面强大, 最近提议了一些语义分解 ViT, 其中多数人取得了令人印象深刻的结果, 但以计算经济成本为代价。在本文中, 我们成功地通过窗口关注机制将多尺度的表达方式引入语义分解 ViT, 并进一步提高性能和效率。为此, 我们引入了大型窗口关注, 使本地窗口在小的计算管理上能够查询更大的环境窗口区域。通过对上下文区域的比例进行调控, 我们让大窗口注意在多个尺度上捕捉背景信息。此外, 空间金字塔集合框架被采纳与大型窗口关注配合, 这展示了一种新型的脱coder, 以空间金字塔集合( Lawin AL ), 我们由此产生的 ViT, Law Invertroateeral, 将演示一个高效的图像变压法。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日