Transformers have shown great success in medical image segmentation. However, transformers may exhibit a limited generalization ability due to the underlying single-scale self-attention (SA) mechanism. In this paper, we address this issue by introducing a Multi-scale hiERarchical vIsion Transformer (MERIT) backbone network, which improves the generalizability of the model by computing SA at multiple scales. We also incorporate an attention-based decoder, namely Cascaded Attention Decoding (CASCADE), for further refinement of multi-stage features generated by MERIT. Finally, we introduce an effective multi-stage feature mixing loss aggregation (MUTATION) method for better model training via implicit ensembling. Our experiments on two widely used medical image segmentation benchmarks (i.e., Synapse Multi-organ, ACDC) demonstrate the superior performance of MERIT over state-of-the-art methods. Our MERIT architecture and MUTATION loss aggregation can be used with downstream medical image and semantic segmentation tasks.
翻译:Transformer在医学图像分割中表现出了很好的效果。然而,由于单尺度的自注意力机制,Transformer可能会展现出有限的泛化能力。为了解决这个问题,我们引入了一个多尺度分层视觉Transformer(MERIT)骨干网络,通过在多个尺度上计算自注意力来提高模型的泛化能力。我们还采用了一种基于注意力的解码器,即级联注意力解码器(CASCADE),进一步对MERIT生成的多级特征进行细化。最后,我们引入了一种有效的多级特征混合损失聚合(MUTATION)方法,通过隐式集成来实现更好的模型训练。我们在两个广泛使用的医学图像分割基准测试(即Synapse多器官、ACDC)上的实验表明,MERIT优于现有方法。我们的MERIT架构和MUTATION损失聚合可用于下游医学图像和语义分割任务。