Performing a real-time and accurate instrument segmentation from videos is of great significance for improving the performance of robotic-assisted surgery. We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration. However, most existing works perform segmentation purely using visual cues in a single frame. Optical flow is just used to model the motion between only two frames and brings heavy computational cost. We propose a novel dual-memory network (DMNet) to wisely relate both global and local spatio-temporal knowledge to augment the current features, boosting the segmentation performance and retaining the real-time prediction capability. We propose, on the one hand, an efficient local memory by taking the complementary advantages of convolutional LSTM and non-local mechanisms towards the relating reception field. On the other hand, we develop an active global memory to gather the global semantic correlation in long temporal range to current one, in which we gather the most informative frames derived from model uncertainty and frame similarity. We have extensively validated our method on two public benchmark surgical video datasets. Experimental results demonstrate that our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
翻译:通过视频进行实时和准确的仪器分解对于改善机器人辅助手术的性能非常重要。 我们确定了外科仪器感知的两个重要线索,包括相邻框架的局部时间依赖性和远距离全球语义相关关系。 但是,大多数现有作品都纯粹使用视觉提示进行分解,光学流只是用来模拟仅两个框架之间的运动,并带来沉重的计算成本。 我们提议了一个新型双模网络(DMNet),以明智地将全球和地方时空知识联系起来,以扩大当前功能,提高分解性能并保留实时预测能力。 我们一方面建议,通过利用进化LSTM和非本地机制的互补优势,将高效的本地记忆用于相关接收场。 另一方面,我们开发一个积极的全球记忆,以将全球长期的语义相关性汇集到当前范围,我们收集出从模型不确定性和框架相似性得出的最丰富的信息框架。我们广泛验证了我们在两个公开的外科影像速度数据基点上采用的方法。 实验性结果显示我们的主要状态。