It is well believed that Transformer performs better in semantic segmentation compared to convolutional neural networks. Nevertheless, the original Vision Transformer may lack of inductive biases of local neighborhoods and possess a high time complexity. Recently, Swin Transformer sets a new record in various vision tasks by using hierarchical architecture and shifted windows while being more efficient. However, as Swin Transformer is specifically designed for image classification, it may achieve suboptimal performance on dense prediction-based segmentation task. Further, simply combing Swin Transformer with existing methods would lead to the boost of model size and parameters for the final segmentation model. In this paper, we rethink the Swin Transformer for semantic segmentation, and design a lightweight yet effective transformer model, called SSformer. In this model, considering the inherent hierarchical design of Swin Transformer, we propose a decoder to aggregate information from different layers, thus obtaining both local and global attentions. Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models, while maintaining a smaller model size and lower compute.
翻译:人们普遍认为,变异器在语义分解方面的表现优于进化神经网络。 然而,原“视觉变异器”可能缺乏当地邻居的感化偏差,而且具有很高的时间复杂性。 最近, Swin变异器通过使用等级结构结构,在提高效率的同时,在各种视觉任务中建立了新记录。 然而,由于Swin变异器是专门设计用于图像分类的,因此在密集的基于预测的分解任务上,它可能取得不理想的性能。此外,简单地用现有方法将Swin变异器梳理 Swin变异器将会导致最终分解模型的模型大小和参数的增强。 在本文中,我们重新思考Swin变异器的语义分解,并设计一个轻巧而有效的变异模型,称为SSexurther。 在这种模型中,考虑到Swin变异形器固有的等级设计,我们建议用解码器将不同层次的信息汇总起来,从而获得本地和全球的注意。实验结果显示,拟议的Ssurferent 收益与最新模型相似的 mIOU性表现,同时保持较小的模型大小和低调。