Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at different scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from $O(n^2)$ to approximate $O(n)$. A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.
翻译:在许多自然语言处理任务中,转换器结构已经出现成功。然而,它对于医疗视觉的应用基本上尚未探索。在本研究中,我们提出了UTNet,这是一个简单而强大的混合变异器结构,将自我注意纳入一个神经神经网络,以加强医学图像的分化。UTNet在编码器和解码器中应用自我注意模块,以在不同尺度上获取长距离依赖性,而管理管理管理管理程度最低。为此,我们提出一个高效的自我注意机制,同时提出相对位置编码,以大幅降低自我注意操作的复杂性,从O(n)2美元到约O(n)美元。我们还提议建立一个新的自我注意解码器,以便从编码器中跳过的连接中恢复精细的精细细节。我们的方法是解决变异器需要大量数据来学习感化偏差的两难点。我们的混合层设计使得变异器的初始化成为同变异网络,而不需要培训。我们评估了多标签、多发型和多发式的自我注意操作,大约是O(n)美元。我们还建议一个新的自我注意解解调重心电磁磁断图象学。我们的方法将其他高的图像显示。