Recently, Visual Transformer (ViT) has been widely used in various fields of computer vision due to applying self-attention mechanism in the spatial domain to modeling global knowledge. Especially in medical image segmentation (MIS), many works are devoted to combining ViT and CNN, and even some works directly utilize pure ViT-based models. However, recent works improved models in the aspect of spatial domain while ignoring the importance of frequency domain information. Therefore, we propose Multi-axis External Weights UNet (MEW-UNet) for MIS based on the U-shape architecture by replacing self-attention in ViT with our Multi-axis External Weights block. Specifically, our block performs a Fourier transform on the three axes of the input feature and assigns the external weight in the frequency domain, which is generated by our Weights Generator. Then, an inverse Fourier transform is performed to change the features back to the spatial domain. We evaluate our model on four datasets and achieve state-of-the-art performances. In particular, on the Synapse dataset, our method outperforms MT-UNet by 10.15mm in terms of HD95. Code is available at https://github.com/JCruan519/MEW-UNet.
翻译:最近,由于在空间领域应用自我关注机制来模拟全球知识,视觉变换器(ViT)被广泛用于计算机视觉的各个领域。特别是在医疗图像分割(MIS)中,许多作品都致力于将ViT和CNN相结合,甚至有些作品直接使用纯ViT的模型。然而,最近在空间领域改进了模型,同时忽略了频域信息的重要性。因此,我们提议基于Ushape结构的多轴外部重力UNet(MEW-UNet)为MIS提供多轴外部重力(MEW-UNet),用我们的多轴外部重力块取代ViT的自我关注。具体地说,我们的区块在输入特性的三个轴上进行了四倍变,并指定了频率域的外部重力,这是由我们的Weights生成的。随后,进行了反面变换,以将特征转换回空间域。我们评价了四个数据集的模型,并实现了艺术状态的性能。特别是,在Syapse数据集中,我们的方法在输入三个轴功能的三轴功能上进行了四度变换,我们的方法在10.95/MDMD/UNMDM519/UNT。