Recently, a variety of vision transformers have been developed as their capability of modeling long-range dependency. In current transformer-based backbones for medical image segmentation, convolutional layers were replaced with pure transformers, or transformers were added to the deepest encoder to learn global context. However, there are mainly two challenges in a scale-wise perspective: (1) intra-scale problem: the existing methods lacked in extracting local-global cues in each scale, which may impact the signal propagation of small objects; (2) inter-scale problem: the existing methods failed to explore distinctive information from multiple scales, which may hinder the representation learning from objects with widely variable size, shape and location. To address these limitations, we propose a novel backbone, namely ScaleFormer, with two appealing designs: (1) A scale-wise intra-scale transformer is designed to couple the CNN-based local features with the transformer-based global cues in each scale, where the row-wise and column-wise global dependencies can be extracted by a lightweight Dual-Axis MSA. (2) A simple and effective spatial-aware inter-scale transformer is designed to interact among consensual regions in multiple scales, which can highlight the cross-scale dependency and resolve the complex scale variations. Experimental results on different benchmarks demonstrate that our Scale-Former outperforms the current state-of-the-art methods. The code is publicly available at: https://github.com/ZJUGiveLab/ScaleFormer.
翻译:最近,作为长距离依赖性模型的模型,开发了各种视觉变压器。在目前基于变压器的医学图像分割主干网中,革命层被纯变压器取代,或者变压器被添加到最深的编码器中学习全球背景。然而,从规模角度上,主要有两个挑战:(1) 规模内问题:现有方法缺乏在每一规模中提取地方-全球线索的能力,这可能影响小物体的信号传播;(2) 规模间问题:现有方法未能探索多种规模的独特信息,这可能会阻碍从大不一的大小、形状和位置的物体中进行代表性学习。为了解决这些局限性,我们提出了一个新的主干线,即Scaleformer, 以学习全球最深层的编码学习。然而,从规模内变压器主要有两个挑战:(1) 规模内变压器旨在将基于CNN的本地特征与基于变压器的全球提示结合起来,每个规模内行间和列式全球依赖度可以通过较轻的双轴调调调调调全球数据。(2) 简单而有效的空间觉间变压系统间变压系统间变压/级间变压系统,在多种规模内, 规模内平级级级内,可以显示不同规模内平级间变压区间变压。