We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconvolutional architecture with multiple heads. We show that this new network design can significantly outperform other state-of-the-art systems when tested on the segmentation of bolus and pharynx/larynx in Videofluoroscopic Swallowing Study (VFSS) CT sequences. On our VFSS2022 dataset it achieves a dice coefficient of 0.8796 and an average surface distance of 1.0379 pixels. Note that tracking the pharyngeal bolus accurately is a particularly important application in clinical practice since it constitutes the primary method for diagnostics of swallowing impairment. Our findings suggest that the proposed model can indeed enhance the TransUNet architecture via exploiting temporal information and improving segmentation performance by a significant margin. We publish key source code, network weights, and ground truth annotations for simplified performance reproduction.
翻译:我们提出视频-TransUNet,这是医学CT视频的深度分割结构,通过将时间特征混入TransUNet深层学习框架而构建的医学CT视频中。特别是,我们的方法通过ResNetCNCN的主干、通过时空环境模块(TCM)混合的多框架特征、通过视觉变异器的非本地关注以及通过一个基于UNet的具有多重头目的革命-革命性结构对多个目标的重建能力等组合组合组合,合并了强大的框架代表。我们表明,这一新网络的设计能够大大优于其他最新水平的系统,因为在视频氟化思潮湿润研究(VFSSSS)的分解测试中,我们的方法结合了强大的框架代表,在VFSSS2022的数据集中,它达到0.8796的dice系数,而平均表面距离为1.0379像素。我们指出,跟踪Pharyngealbulus的准确性能是临床实践中的一项特别重要的应用,因为它构成了吞蚀障碍的主要诊断方法。我们的研究结果表明,拟议的模型确实可以通过简化的地面结构来改进Straximal Statimal Stalation。