Transformers have proved to be very effective for visual recognition tasks. In particular, vision transformers construct compressed global representations through self-attention and learnable class tokens. Multi-resolution transformers have shown recent successes in semantic segmentation but can only capture local interactions in high-resolution feature maps. This paper extends the notion of global tokens to build GLobal Attention Multi-resolution (GLAM) transformers. GLAM is a generic module that can be integrated into most existing transformer backbones. GLAM includes learnable global tokens, which unlike previous methods can model interactions between all image regions, and extracts powerful representations during training. Extensive experiments show that GLAM-Swin or GLAM-Swin-UNet exhibit substantially better performances than their vanilla counterparts on ADE20K and Cityscapes. Moreover, GLAM can be used to segment large 3D medical images, and GLAM-nnFormer achieves new state-of-the-art performance on the BCV dataset.
翻译:事实证明,变压器对于视觉识别任务非常有效。 特别是, 视觉变压器通过自我注意和可学习的类符号构建压缩的全球表示器。 多分辨率变压器在语义分解方面最近取得了成功, 但只能捕捉高分辨率特征地图中的本地互动。 本文扩展了全球象征器的概念, 以构建 GLobal 注意多分辨率变压器。 GLAM 是一个通用模块, 可以整合到大多数现有的变压器骨干中。 GLAM 包括可学习的全球象征物, 与以往的方法不同, 它可以模拟所有图像区域之间的相互作用, 并在培训中提取强有力的表示器。 广泛的实验显示, GLAM- Swin 或 GLAM- Swin- UNet 展示的性能比其在 AD20K 和 Cityscorpes 上的香草对等的性能要好得多。 此外, GLAM 可用于大型 3D 医学图像的段, GLAM- nnFormer 可以在 BCVD数据集上取得新的状态表现。