Most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new experimental framework towards holistic surgical scene understanding. First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos. Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding. TAPIR leverages our dataset's multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Our experimental results in both PSI-AVA and other publicly available databases demonstrate the adequacy of our framework to spur future research on holistic surgical scene understanding.
翻译:研究外科手术干预的大多数基准侧重于具体的挑战,而不是利用不同任务之间的内在互补性。在这项工作中,我们提出了一个新的实验框架,以全面外科手术现场理解。首先,我们引入了阶段、步骤、仪器和原子视觉行动识别(PSI-AVA)数据集。PSI-AVA包含长期(阶段和步骤识别)和短期推理(仪器检测和新型原子行动识别)的说明,这些说明都包含在机器人辅助的激进前列腺切片视频中。第二,我们提出“变换器以采取行动、阶段、仪器和步骤识别(TAPIR)”作为手术现场理解的强有力的基线。TAPIR利用我们数据集的多层次说明,因为它得益于在仪器检测任务上积累的经验,以提高其分类能力。我们在PSI-AVA和其他公开数据库中的实验结果表明,我们的框架足以促进今后对全面外科现场理解的研究。