Owing to success in the data-rich domain of natural images, Transformers have recently become popular in medical image segmentation. However, the pairing of Transformers with convolutional blocks in varying architectural permutations leaves their relative effectiveness to open interpretation. We introduce Transformer Ablations that replace the Transformer blocks with plain linear operators to quantify this effectiveness. With experiments on 8 models on 2 medical image segmentation tasks, we explore -- 1) the replaceable nature of Transformer-learnt representations, 2) Transformer capacity alone cannot prevent representational replaceability and works in tandem with effective design, 3) The mere existence of explicit feature hierarchies in transformer blocks is more beneficial than accompanying self-attention modules, 4) Major spatial downsampling before Transformer modules should be used with caution.
翻译:鉴于Transformer在自然图像这个数据丰富的领域中的成功,它最近在医学图像分割中变得越来越流行。然而,将Transformer与卷积块配对,采用不同的体系结构排列,使它们的相对有效性成为开放性的解释。我们引入Transformer消融,用普通的线性算子替换Transformer块,以量化这种有效性。通过对两个医学图像分割任务上的8种模型进行实验,探索了以下四个方面:1)Transformer学习表示的可替换性, 2)Transformer容量单独无法防止表示可替换性,与有效的设计协同工作,3)Transformer块中明确特征层次结构的存在本身就比伴随自我注意力模块更有益, 4)在Transformer模块之前进行的主要空间下采样应谨慎使用。