The fully-convolutional network (FCN) with an encoder-decoder architecture has become the standard paradigm for semantic segmentation. The encoder-decoder architecture utilizes an encoder to capture multi-level feature maps, which are then incorporated into the final prediction by a decoder. As the context is critical for precise segmentation, tremendous effort has been made to extract such information in an intelligent manner, including employing dilated/atrous convolutions or inserting attention modules. However, the aforementioned endeavors are all based on the FCN architecture with ResNet backbone which cannot tackle the context issue from the root. By contrast, we introduce the Swin Transformer as the backbone to fully extract the context information and design a novel decoder named densely connected feature aggregation module (DCFAM) to restore the resolution and generate the segmentation map. The extensive experiments on two datasets demonstrate the effectiveness of the proposed scheme.
翻译:带有编码器- 解码器结构的全革命网络(FCN)已成为语义分解的标准范式。编码器- 解码器结构利用编码器捕捉多层次地貌图,然后将其纳入解码器的最后预测中。由于环境对精确分解至关重要,因此已作出巨大努力,以明智的方式提取这类信息,包括使用变形/突变或插入注意模块。然而,上述努力都基于具有ResNet主干线的FCN结构,而ResNet主干线无法从根部解决上下文问题。相比之下,我们采用Swin变形器作为主干线,以充分提取背景信息,并设计名为密集连接地貌聚合模块(DCFAM)的新型解码器,以恢复分辨率并生成分解图。关于两个数据集的广泛实验显示了拟议方案的有效性。