Activity recognition in surgical videos is a key research area for developing next-generation devices and workflow monitoring systems. Since surgeries are long processes with highly-variable lengths, deep learning models used for surgical videos often consist of a two-stage setup using a backbone and temporal sequence model. In this paper, we investigate many state-of-the-art backbones and temporal models to find architectures that yield the strongest performance for surgical activity recognition. We first benchmark the models performance on a large-scale activity recognition dataset containing over 800 surgery videos captured in multiple clinical operating rooms. We further evaluate the models on the two smaller public datasets, the Cholec80 and Cataract-101 datasets, containing only 80 and 101 videos respectively. We empirically found that Swin-Transformer+BiGRU temporal model yielded strong performance on both datasets. Finally, we investigate the adaptability of the model to new domains by fine-tuning models to a new hospital and experimenting with a recent unsupervised domain adaptation approach.
翻译:外科视频中的活动识别是开发下一代设备和工作流程监测系统的一个关键研究领域。由于外科手术是长过程,长度变化很大,用于外科手术视频的深学习模式往往包括使用主干和时间序列模型的两阶段设置。在本文中,我们调查了许多最先进的骨干和时间模型,以寻找能够产生最强外科活动识别性能的架构。我们首先将模型性能基准在包含800多个临床操作室所捕捉的800多部外科视频的大型活动识别数据集上。我们进一步评估了两个较小的公共数据集的模型,即Cholec80和Cataract-101数据集,分别只包含80和101个视频。我们从经验上发现,Swin-Transforent+BIGRU时间模型在两个数据集上都取得了很强的性能。最后,我们通过微调模型到新医院并试验最近未受控制的域适应方法,调查模型在新领域适应新领域的适应性能。