While a large number of recent works on semantic segmentation focus on designing and incorporating a transformer-based encoder, much less attention and vigor have been devoted to transformer-based decoders. For such a task whose hallmark quest is pixel-accurate prediction, we argue that the decoder stage is just as crucial as that of the encoder in achieving superior segmentation performance, by disentangling and refining the high-level cues and working out object boundaries with pixel-level precision. In this paper, we propose a novel transformer-based decoder called UperFormer, which is plug-and-play for hierarchical encoders and attains high quality segmentation results regardless of encoder architecture. UperFormer is equipped with carefully designed multi-head skip attention units and novel upsampling operations. Multi-head skip attention is able to fuse multi-scale features from backbones with those in decoders. The upsampling operation, which incorporates feature from encoder, can be more friendly for object localization. It brings a 0.4% to 3.2% increase compared with traditional upsampling methods. By combining UperFormer with Swin Transformer (Swin-T), a fully transformer-based symmetric network is formed for semantic segmentation tasks. Extensive experiments show that our proposed approach is highly effective and computationally efficient. On Cityscapes dataset, we achieve state-of-the-art performance. On the more challenging ADE20K dataset, our best model yields a single-scale mIoU of 50.18, and a multi-scale mIoU of 51.8, which is on-par with the current state-of-art model, while we drastically cut the number of FLOPs by 53.5%. Our source code and models are publicly available at: https://github.com/shiwt03/UperFormer
翻译:虽然最近大量关于语义分解的工程侧重于设计和整合基于变压器的变压器编码器, 但对于基于变压器的解码器来说, 关注度和振动量远不如对变压器的解析器。 对于一个其标志性追求是像素-精密预测的任务来说, 我们争辩说, 解码器的解码器阶段与编码器的解析器阶段一样重要, 通过拆译和精细化高端导线, 以像素级的精确度计算目标界限。 在本文中, 我们提议一个新型变压器的解调器- Uper Former, 名为 Uper Former, 以50级编码的变压器和游戏为基础, 以变压器为基础的变压器 。 高端变压器的变压机型模型 与高型的变压机型模型 完全匹配。 我们的变压式的变压式模型, 我们的变压式操作, 和变压式的变式的变压器 将更方便本地化器 。