We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconvolutional architecture with multiple heads. We show that this new network design can significantly outperform other state-of-the-art systems when tested on the segmentation of bolus and pharynx/larynx in Videofluoroscopic Swallowing Study (VFSS) CT sequences. On our VFSS2022 dataset it achieves a dice coefficient of $0.8796\%$ and an average surface distance of $1.0379$ pixels. Note that tracking the pharyngeal bolus accurately is a particularly important application in clinical practice since it constitutes the primary method for diagnostics of swallowing impairment. Our findings suggest that the proposed model can indeed enhance the TransUNet architecture via exploiting temporal information and improving segmentation performance by a significant margin. We publish key source code, network weights, and ground truth annotations for simplified performance reproduction.
翻译:我们提出视频-TransUNet,这是医学CT视频的深度分割结构,通过将时间特征混入TransUNet深层学习框架,在医疗CT视频中构建了一个深层结构,例如,视频-TransUNet。特别是,我们的方法通过ResNet CNN主干、多框架特征混在一起,通过时空环境模块(TCM),非本地关注,以及通过一个基于UNet的具有多重头目的革命-分变结构,为多重目标重建能力。我们表明,这一新网络设计在对视频氟化石湿润研究(VFSSSS)的分解进行测试时,可以大大优于其他最新水平的系统。我们的调查结果表明,在VFSSS2022的数据集中,它达到0.8796美元的dice系数,而平均表面距离为10 379美元的像素。我们指出,跟踪pharyngealblus的准确性能是临床实践中的一项特别重要的应用,因为它构成了吞蚀性缺陷的主要诊断方法。我们提出的研究结果表明,我们提议的Syleet roduction rodutional resulation ladeal lade ex ex ex ex ladeal ladeal ex ex ex ex ex ex ex lade ex ex ex ex ex subilence subilence ex ex ex ex ex