The fully-convolutional network (FCN) with an encoder-decoder architecture has been the standard paradigm for semantic segmentation. The encoder-decoder architecture utilizes an encoder to capture multi-level feature maps, which are incorporated into the final prediction by a decoder. As the context is crucial for precise segmentation, tremendous effort has been made to extract such information in an intelligent fashion, including employing dilated/atrous convolutions or inserting attention modules. However, these endeavours are all based on the FCN architecture with ResNet or other backbones, which cannot fully exploit the context from the theoretical concept. By contrast, we propose the Swin Transformer as the backbone to extract the context information and design a novel decoder of densely connected feature aggregation module (DCFAM) to restore the resolution and produce the segmentation map. The experimental results on two remotely sensed semantic segmentation datasets demonstrate the effectiveness of the proposed scheme.
翻译:具有编码器-解码器结构的全革命网络(FCN)一直是语义分解的标准范式。编码器-解码器结构利用编码器捕捉多层次地貌图,这些图已纳入解码器的最后预测中。由于环境对精确分解至关重要,因此已作出巨大努力,以明智的方式提取这类信息,包括使用变异/突变或插入注意模块。然而,这些努力都以具有ResNet或其他主干网的FCN结构或其他主干网为基础,无法充分利用理论概念的背景。相比之下,我们提议用Swin变形器作为主干线,提取背景信息,设计一个连接密密地貌集成模块(DCFAM)的新式解码器,以恢复分辨率并制作分解图。两个遥感分解数据集的实验结果显示了拟议办法的有效性。