We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of semantic segmentation due to the efficiency of self-attention in encoding spatial information. In this paper, we show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers. By re-examining the characteristics owned by successful segmentation models, we discover several key components leading to the performance improvement of segmentation models. This motivates us to design a novel convolutional attention network that uses cheap convolutional operations. Without bells and whistles, our SegNeXt significantly improves the performance of previous state-of-the-art methods on popular benchmarks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. Notably, SegNeXt outperforms EfficientNet-L2 w/ NAS-FPN and achieves 90.6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it. On average, SegNeXt achieves about 2.0% mIoU improvements compared to the state-of-the-art methods on the ADE20K datasets with the same or fewer computations. Code is available at https://github.com/uyzhang/JSeg (Jittor) and https://github.com/Visual-Attention-Network/SegNeXt (Pytorch).
翻译:我们展示了SegNeXt, 这是一种简单的语义分解组合网络结构。 最近以变压器为基础的模型由于在编码空间信息中自我注意的效率而主导了语义分解领域。 在本文中,我们显示,与变压器中的自我注意机制相比,共进关注是编码背景信息的一种更高效和更有效的方法。通过重新审查成功分解模型拥有的特性,我们发现了导致分解模型性能改进的几个关键组成部分。这促使我们设计了一个使用廉价电动操作的新颖的共进关注网络。没有钟声和哨子,我们的SegNegNeXt显著改进了以前在流行基准方面的先进方法,包括ADE20K、城市风景、CO-CO-Stuf、Pascal VOC、Pascal背景和iSAID。 值得注意的是,SegNEX-Seral-Segal-Nequal-Nequal-Nequal-Seg-Seg-Seg-Seral-Seg-Seg-Seral-Neg-Seral-Seral-Neg-Neg-Sildal-s-s-s-s-s-sal-sal-Silpal-Silpal-deal-s) 和Seg-sal-s-sal-sal-s-s-s-s-sal-sm10/sal-s-s-sal-sal-sal-sal-sal-sal-sldal-s-s-s-sm-s-s-s-s-s-s-s-s-s-s-sal-s-sl-sl-sl-sl-sl-sl-sl-sl-sal-sal-sal-sal-sal-sal-sal-sal-sl-sl-sl-xxxxxxxxxxxx,在Seral-sal-sal-sal-sal-sl-sl-sal-sal-sal-sal-sal-sl