CoTr: 高效连接CNN和3D医学图像分割变换器 (CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation)

Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation. The convolutional operations used in these networks, however, inevitably have limitations in modeling the long-range dependency due to their inductive bias of locality and weight sharing. Although Transformer was born to address this issue, it suffers from extreme computational and spatial complexities in processing high-resolution 3D feature maps. In this paper, we propose a novel framework that efficiently bridges a {\bf Co}nvolutional neural network and a {\bf Tr}ansformer {\bf (CoTr)} for accurate 3D medical image segmentation. Under this framework, the CNN is constructed to extract feature representations and an efficient deformable Transformer (DeTrans) is built to model the long-range dependency on the extracted feature maps. Different from the vanilla Transformer which treats all image positions equally, our DeTrans pays attention only to a small set of key positions by introducing the deformable self-attention mechanism. Thus, the computational and spatial complexities of DeTrans have been greatly reduced, making it possible to process the multi-scale and high-resolution feature maps, which are usually of paramount importance for image segmentation. We conduct an extensive evaluation on the Multi-Atlas Labeling Beyond the Cranial Vault (BCV) dataset that covers 11 major human organs. The results indicate that our CoTr leads to a substantial performance improvement over other CNN-based, transformer-based, and hybrid methods on the 3D multi-organ segmentation task. Code is available at \def\UrlFont{\rm\small\ttfamily} \url{https://github.com/YtongXie/CoTr}

翻译：3D 医学图像分割事实上是3D 的事实上标准。然而,这些网络中所使用的进化操作不可避免地限制了长距离依赖性模型的建模, 因为这些网络具有感性地和重量共享的偏差。虽然变异器生来就是为了解决这个问题, 但在处理高分辨率 3D 特征地图时, 却存在极端的计算和空间复杂性。在本文中, 我们提出了一个新颖的框架, 有效地连接了 {b} 进化神经网络和 rf trent 3D 医学图像分割。在这个框架内, CNN 的建构是为了提取特征显示和高效变形变形变形器(DeTranser), 建构了对提取的地貌图的长期依赖性模型。不同于同等处理所有图像位置的vanilla 变异器, 我们的变异式仅关注一小组的关键位置, 引入了基于可变形的自我保存机制。因此, 变异的计算和空间变异变变变变变变变的系统系统导致3DR 解的变形图像分割。因此, 的计算和变变变变的变变变变变变的变的变变变变变变的图像部分,,,, 的变变变变变变变变的人类的图像部分,, 变变变的人类的人类的图像的图像的图像的图像的图像的模型的模型的模型的功能成为成为成为成为了大的功能,,, 超越了大的功能,,,,,,,, 的功能的功能的功能的功能的功能的功能的功能的功能的功能的功能, 超越了的的的的的的的,,,,,,,, 的变变形, 变变变形, 变形, 变变变变变变变变变变变变变变变变变的的变的变的变的变的变的变的变的变的变的的变的变的的的的的的的