One of the recent advances in surgical AI is the recognition of surgical activities as triplets of (instrument, verb, target). Albeit providing detailed information for computer-assisted intervention, current triplet recognition approaches rely only on single frame features. Exploiting the temporal cues from earlier frames would improve the recognition of surgical action triplets from videos. In this paper, we propose Rendezvous in Time (RiT) - a deep learning model that extends the state-of-the-art model, Rendezvous, with temporal modeling. Focusing more on the verbs, our RiT explores the connectedness of current and past frames to learn temporal attention-based features for enhanced triplet recognition. We validate our proposal on the challenging surgical triplet dataset, CholecT45, demonstrating an improved recognition of the verb and triplet along with other interactions involving the verb such as (instrument, verb). Qualitative results show that the RiT produces smoother predictions for most triplet instances than the state-of-the-arts. We present a novel attention-based approach that leverages the temporal fusion of video frames to model the evolution of surgical actions and exploit their benefits for surgical triplet recognition.
翻译:外科手术人工智能的最新进展之一是承认外科手术活动为三重(工具、动词、目标)的三重(工具、动词、目标)。尽管为计算机辅助干预提供了详细的信息,但目前的三重识别方法仅依靠单一框架特征。利用早期框架的时间提示将提高对视频中外科手术动作三重体的认知。在本文中,我们提议在时间(RiT)中共和(RiT) -- -- 一种深入学习的模式,扩展了最先进的模式,即共鸣,并带有时间模型。我们RiT更多地关注动词,探索当前和过去框架的关联性,以学习基于时间的注意力特征,加强三重体识别。我们验证了我们关于具有挑战性的外科手术三重数据集CholecT45的建议,表明对动词和三重力的认知,以及涉及动词(工具、动词)等其他互动关系,表明RiT为比国家模型和外科手术三重动作的三重的三重预测提供了更平稳的预测。我们展示了对立式外科外科外科手术进制动作方法的新认识。