学习如何鲁棒地估计内窥镜视频中的相机姿态 (Learning How To Robustly Estimate Camera Pose in Endoscopic Videos)

Purpose: Surgical scene understanding plays a critical role in the technology stack of tomorrow's intervention-assisting systems in endoscopic surgeries. For this, tracking the endoscope pose is a key component, but remains challenging due to illumination conditions, deforming tissues and the breathing motion of organs. Method: We propose a solution for stereo endoscopes that estimates depth and optical flow to minimize two geometric losses for camera pose estimation. Most importantly, we introduce two learned adaptive per-pixel weight mappings that balance contributions according to the input image content. To do so, we train a Deep Declarative Network to take advantage of the expressiveness of deep-learning and the robustness of a novel geometric-based optimization approach. We validate our approach on the publicly available SCARED dataset and introduce a new in-vivo dataset, StereoMIS, which includes a wider spectrum of typically observed surgical settings. Results: Our method outperforms state-of-the-art methods on average and more importantly, in difficult scenarios where tissue deformations and breathing motion are visible. We observed that our proposed weight mappings attenuate the contribution of pixels on ambiguous regions of the images, such as deforming tissues. Conclusion: We demonstrate the effectiveness of our solution to robustly estimate the camera pose in challenging endoscopic surgical scenes. Our contributions can be used to improve related tasks like simultaneous localization and mapping (SLAM) or 3D reconstruction, therefore advancing surgical scene understanding in minimally-invasive surgery.

翻译：目的：手术场景理解在内窥镜手术的干预辅助系统技术堆栈中起着至关重要的作用。为此，跟踪内窥镜姿态是关键组成部分，但由于照明条件、组织变形和器官呼吸运动而仍然具有挑战性。方法：我们提出了一种针对立体内窥镜的解决方案，它估计深度和光流以最小化两个几何损失来进行相机姿态估计。最重要的是，我们引入了两个学习的自适应像素权重映射，根据输入图像内容平衡贡献。为此，我们训练了一个深度申明性网络，利用深度学习的表现力和新颖的基于几何的优化方法的鲁棒性。我们在公开可用的SCARED数据集上验证了我们的方法，并介绍了一个新的体内数据集StereoMIS，其中包括更广泛的通常观察到的手术设置。结果：我们的方法平均优于最先进的方法，更重要的是，在出现组织变形和呼吸运动可见的困难场景中表现出色。我们观察到，我们提出的权重映射会削弱模糊区域（例如变形的组织）的像素贡献。结论：我们展示了我们的解决方案在具有挑战性的内窥镜手术场景中鲁棒地估计相机姿态的有效性。我们的贡献可用于改进相关任务，如同时定位和映射（SLAM）或3D重建，从而推进微创手术中的手术场景理解。