The recent 3D medical ViTs (e.g., SwinUNETR) achieve the state-of-the-art performances on several 3D volumetric data benchmarks, including 3D medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced several ConvNet priors and further enhanced the practical viability of adapting volumetric segmentation in 3D medical datasets. The effectiveness of hybrid approaches is largely credited to the large receptive field for non-local self-attention and the large number of model parameters. In this work, we propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7\times7\times7$) to enable the larger global receptive fields, inspired by Swin Transformer. We further substitute the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise depth convolutions and enhance model performances with fewer normalization and activation layers, thus reducing the number of model parameters. 3D UX-Net competes favorably with current SOTA transformers (e.g. SwinUNETR) using three challenging public datasets on volumetric brain and abdominal imaging: 1) MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI Challenge 2022 AMOS. 3D UX-Net consistently outperforms SwinUNETR with improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice (Feta2021). We further evaluate the transfer learning capability of 3D UX-Net with AMOS2022 and demonstrates another improvement of $2.27\%$ Dice (from 0.880 to 0.900). The source code with our proposed model are available at https://github.com/MASILab/3DUX-Net.
翻译:最近的 3D 医学 ViTs (如 SwinUNET) 在数个 3D 体积数据基准( 包括 3D 体积图像路段) 上实现了最先进的表现。 高级变压器( 如 Swin 变换器) 重新引入了多个 ConvNet 前期, 并进一步加强了在 3D 医疗数据集中调整体积分解的实用可行性。 混合方法的实效主要归功于非本地自我关注和大量挑战性模型的大型可接受域。 在这项工作中,我们提议了一个轻量体积体积ConvNet, 称为 3DO 挑战性3O 体积数据路段, 利用ConvilNet模块进行等级变换换。 具体地说,我们重新审视体积深度深度变异变变变的体积( 从 7美元开始, 7美元 7美元) 以Swinual 20FIFRS 数据路段( 我们用Swin 变型变型变体变型的多级变形变体型) 3O- 的MIX 3X 3DISMIL 数据路数据路, 。