Transformers have shown impressive performance in various natural language processing and computer vision tasks, due to the capability of modeling long-range dependencies. Recent progress has demonstrated that combining such Transformers with CNN-based semantic image segmentation models is very promising. However, it is not well studied yet on how well a pure Transformer based approach can achieve for image segmentation. In this work, we explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN). Specifically, we first propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, meanwhile reducing the computation complexity of the standard Visual Transformer (ViT). Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation. Surprisingly, this simple baseline can achieve better results on multiple challenging semantic segmentation and face parsing benchmarks, including PASCAL Context, ADE20K, COCOStuff, and CelebAMask-HQ. The source code will be released on https://github.com/BR-IDL/PaddleViT.
翻译:在各种自然语言处理和计算机视觉任务中,由于模拟长距离依赖性的能力,变异器在各种自然语言处理和计算机视觉任务中表现出了令人印象深刻的性能。最近的进展表明,将这类变异器与CNN的语义图像分解模型相结合非常有希望。然而,对于纯变异器在图像分化方面能取得多大的成绩,还没有很好地加以研究。在这项工作中,我们探索了一种基于全变异器网络(FTN)的语义图像分解新框架。具体地说,我们首先建议用金字形集团变异器作为逐步学习等级特征的编码器,同时降低标准视觉变异器(VIT)的计算复杂性。然后,我们建议用一个功能性变异形变异器(FPFT)来将语义级和空间级信息从PGTT的图像分解器(FTNFT)的多个级别融合。令人惊讶的是,这个简单的基线可以在多重具有挑战性的语义分解和面分析基准上取得更好的结果,包括 PASAL、ADE20K、T-DUSK、CO-SUDLFM/SUDUDFM、M/SLDFLs、MDLs、MDL、MDLs、M/SLDLDLDLDL、MADL、M/DL、M/SDLDLDL、MA/DLDL、MADL、MADs/Ds/DL、M、MADL、M、M、MADL、MADL、L、L、M、L、L、L、L、MASDL、L、MASDL、MASD/DL、L、L、L、L、L、LDL、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、MAD、L、L、L、L、L、L、L、L、MA、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、L、