Automatic recognition of fine-grained surgical activities, called steps, is a challenging but crucial task for intelligent intra-operative computer assistance. The development of current vision-based activity recognition methods relies heavily on a high volume of manually annotated data. This data is difficult and time-consuming to generate and requires domain-specific knowledge. In this work, we propose to use coarser and easier-to-annotate activity labels, namely phases, as weak supervision to learn step recognition with fewer step annotated videos. We introduce a step-phase dependency loss to exploit the weak supervision signal. We then employ a Single-Stage Temporal Convolutional Network (SS-TCN) with a ResNet-50 backbone, trained in an end-to-end fashion from weakly annotated videos, for temporal activity segmentation and recognition. We extensively evaluate and show the effectiveness of the proposed method on a large video dataset consisting of 40 laparoscopic gastric bypass procedures and the public benchmark CATARACTS containing 50 cataract surgeries.
翻译:自动识别细粒度手术活动(称为步骤)是智能术中计算机辅助的具有挑战性但至关重要的任务。当前基于视觉的活动识别方法的发展严重依赖于大量手动注释的数据。这些数据很难制备且耗时,并需要特定领域的知识。在本文中,我们提出使用较粗糙且易于注释的活动标签(即阶段)作为弱监督,以较少的步骤注释视频学习步骤识别。我们引入了一种步骤 - 阶段依存“损失”,以利用弱监督信号。然后,我们使用具有ResNet-50主干的单级时间卷积神经网络(SS-TCN)在弱标注视频上以端对端方式进行训练,用于时间活动分割和识别。我们对由40个腹腔镜胃旁路手术和包含50个白内障手术的公共基准CATARACTS组成的大型视频数据集进行了广泛的评估,并展示了所提出的方法的有效性。