Semantic segmentation usually benefits from global contexts, fine localisation information, multi-scale features, etc. To advance Transformer-based segmenters with these aspects, we present a simple yet powerful semantic segmentation architecture, termed as IncepFormer. IncepFormer has two critical contributions as following. First, it introduces a novel pyramid structured Transformer encoder which harvests global context and fine localisation features simultaneously. These features are concatenated and fed into a convolution layer for final per-pixel prediction. Second, IncepFormer integrates an Inception-like architecture with depth-wise convolutions, and a light-weight feed-forward module in each self-attention layer, efficiently obtaining rich local multi-scale object features. Extensive experiments on five benchmarks show that our IncepFormer is superior to state-of-the-art methods in both accuracy and speed, e.g., 1) our IncepFormer-S achieves 47.7% mIoU on ADE20K which outperforms the existing best method by 1% while only costs half parameters and fewer FLOPs. 2) Our IncepFormer-B finally achieves 82.0% mIoU on Cityscapes dataset with 39.6M parameters. Code is available:github.com/shendu0321/IncepFormer.
翻译:语义分解通常受益于全球背景、精细的本地化信息、多尺度特征等。 为推进以变异器为基础的分解器, 我们展示了一个简单而强大的语义分解结构, 称为 IncepFormer。 IncepFormer 有两个重要贡献如下。 首先, 它引入了一个新的金字塔结构结构化变异器编码器, 以同时捕获全球背景和精细的本地化特征。 这些特征被结合并输入一个卷变层, 用于最终的每像素预测。 其次, IncepFormer 将一个类似概念的架构与深度进化器相融合, 并在每个自我注意层中推出一个轻巧又强大的语义化的语义化分解结构, 被称之为 IncepForup-formation 结构, 被有效地获取丰富的本地多尺度对象特征。 在五个基准上进行的广泛实验表明, 我们的 Inepformermer 以精确和速度两种方式优于最先进的方法, 例如, 1, 我们的 InepFO-LOP 2, 我们的代码/Ms Forpres Forpres for