Transformers have shown impressive performance in various natural language processing and computer vision tasks, due to the capability of modeling long-range dependencies. Recent progress has demonstrated to combine such transformers with CNN-based semantic image segmentation models is very promising. However, it is not well studied yet on how well a pure transformer based approach can achieve for image segmentation. In this work, we explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN). Specifically, we first propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT). Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation. Surprisingly, this simple baseline can achieve new state-of-the-art results on multiple challenging semantic segmentation benchmarks, including PASCAL Context, ADE20K and COCO-Stuff. The source code will be released upon the publication of this work.
翻译:在各种自然语言处理和计算机视觉任务中,由于模拟长距离依赖性的能力,变异器在各种自然语言处理和计算机视觉任务中表现出了令人印象深刻的性能。最近的进展表明,将这类变异器与CNN的语义图像分解模型相结合是非常有希望的。然而,对纯变异器在图像分化方面能取得多大的效益,还没有进行充分的研究。在这项工作中,我们探索了一种基于全变异器网络(FTN)的语义图像分解新框架。具体地说,我们首先建议用一个金字形集团变异器作为逐渐学习等级特征的编码器,同时降低标准视觉变异器(VIT)的计算复杂性。然后,我们建议用一个Fetatic Pyramid变异器(FPTT) 来将语义级和空间级信息从PGT T 图像分解的多层融合起来。令人惊讶的是,这一简单基线可以取得关于多重具有挑战性的语义分解基准的新的状态结果,包括PASAL背景、ADE20K和CO的源码。