Transformers have demonstrated remarkable performance in natural language processing and computer vision. However, existing vision Transformers struggle to learn from limited medical data and are unable to generalize on diverse medical image tasks. To tackle these challenges, we present MedFormer, a data-scalable Transformer designed for generalizable 3D medical image segmentation. Our approach incorporates three key elements: a desirable inductive bias, hierarchical modeling with linear-complexity attention, and multi-scale feature fusion that integrates spatial and semantic information globally. MedFormer can learn across tiny- to large-scale data without pre-training. Comprehensive experiments demonstrate MedFormer's potential as a versatile segmentation backbone, outperforming CNNs and vision Transformers on seven public datasets covering multiple modalities (e.g., CT and MRI) and various medical targets (e.g., healthy organs, diseased tissues, and tumors). We provide public access to our models and evaluation pipeline, offering solid baselines and unbiased comparisons to advance a wide range of downstream clinical applications.
翻译:Transformer在自然语言处理和计算机视觉领域展示了出色的性能。然而,现有的视觉Transformer很难从有限的医学数据中学习,并且无法在多样化的医学图像任务上泛化。为了解决这些挑战,我们提出了MedFormer,这是一个设计用于通用的3D医学图像分割的可扩展Transformer。我们的方法包括三个关键元素:所需的归纳偏置、具有线性复杂度注意力的分层模型和多尺度特征融合,可以全局地整合空间和语义信息。MedFormer可以在没有预训练的情况下跨越小到大的数据进行学习。全面的实验表明,MedFormer作为通用的分割骨干网具有潜力,优于CNN和视觉Transformer在覆盖多种模态(例如CT和MRI)和各种医学目标(例如健康器官、疾病组织和肿瘤)的七个公共数据集上。我们提供了公共的模型和评估流程,为推进各种临床应用提供了坚实的基础和无偏的比较。