This paper presents a deep learning framework for medical video segmentation. Convolution neural network (CNN) and transformer-based methods have achieved great milestones in medical image segmentation tasks due to their incredible semantic feature encoding and global information comprehension abilities. However, most existing approaches ignore a salient aspect of medical video data - the temporal dimension. Our proposed framework explicitly extracts features from neighbouring frames across the temporal dimension and incorporates them with a temporal feature blender, which then tokenises the high-level spatio-temporal feature to form a strong global feature encoded via a Swin Transformer. The final segmentation results are produced via a UNet-like encoder-decoder architecture. Our model outperforms other approaches by a significant margin and improves the segmentation benchmarks on the VFSS2022 dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets tested. Our studies also show the efficacy of the temporal feature blending scheme and cross-dataset transferability of learned capabilities. Code and models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet.
翻译:本文介绍了医学视频分割的深层学习框架。 进化神经网络(CNN)和以变压器为基础的方法由于其令人难以置信的语义特征编码和全球信息理解能力,在医学图像分割任务方面取得了巨大的里程碑。 然而,大多数现有方法忽视了医疗视频数据的一个突出方面----时间层面。我们提议的框架明确从相邻框架中提取了跨时间层面的特征,并结合了一个时间特征混合器,从而象征了高层次的时空特征,形成一个强大的全球特征,通过一个双向变换器编码。最终的分解结果通过一个UNet- like encoder- decoder结构生成。我们的模型大大超越了VFSS2022数据集的其他方法,为测试的两个数据集实现了0.8986和0.8186的离子系数。我们的研究还展示了时间特征混合机制的功效和学习能力的交叉数据传输能力。代码和模型可在 https://github.com/SimonZet8/Van8/Vevo充分查阅。