U-Nets have achieved tremendous success in medical image segmentation. Nevertheless, it may suffer limitations in global (long-range) contextual interactions and edge-detail preservation. In contrast, Transformer has an excellent ability to capture long-range dependencies by leveraging the self-attention mechanism into the encoder. Although Transformer was born to model the long-range dependency on the extracted feature maps, it still suffers from extreme computational and spatial complexities in processing high-resolution 3D feature maps. This motivates us to design the efficiently Transformer-based UNet model and study the feasibility of Transformer-based network architectures for medical image segmentation tasks. To this end, we propose to self-distill a Transformer-based UNet for medical image segmentation, which simultaneously learns global semantic information and local spatial-detailed features. Meanwhile, a local multi-scale fusion block is first proposed to refine fine-grained details from the skipped connections in the encoder by the main CNN stem through self-distillation, only computed during training and removed at inference with minimal overhead. Extensive experiments on BraTS 2019 and CHAOS datasets show that our MISSU achieves the best performance over previous state-of-the-art methods. Code and models are available at \url{https://github.com/wangn123/MISSU.git}
翻译:U-Net在医疗图象分割方面已经取得了巨大的成功。 尽管如此,它可能在全球(长程)背景互动和边缘距离保护方面受到限制。相反,变异器通过将自我注意机制运用到编码器中,非常有能力捕捉远程依赖性。虽然变异器诞生于模拟对提取的地貌图的长期依赖,但在处理高分辨率 3D 地貌图时仍然受到极端的计算和空间复杂性的影响。这促使我们设计高效的变异器基于UNet的模型,并研究基于变异器的网络结构对于医疗图象分割任务的可行性。为此,我们提议通过将基于变异器的UNet用于医学图像分割的变异器Uet进行自我提炼,同时学习全球语义信息和地方空间脱节特征。与此同时,首先提议建立一个本地多规模的聚变区,以通过自我消化的方式改进主CNN 123 23MIS 断开的连接的精细细节。 仅在培训期间进行计算,并在使用最低的顶部进行推断时去除。 在BRATIS 2019 和CAS 上的现有最佳业绩测试中显示我们的最佳方法。