深度视觉：基于单目深度先验的无训练手术场景分割 (See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors)

Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pretrained monocular depth estimation network and proposes depth-guided point prompts, which SAM2 converts into class-agnostic masks. Each mask is then described by a pooled pretrained visual feature and classified via template matching against a template bank built from annotated frames. On the CholecSeg8k dataset, DepSeg improves over a direct SAM2 auto segmentation baseline (35.9% vs. 14.7% mIoU) and maintains competitive performance even when using only 10--20% of the object templates. These results show that depth-guided prompting and template-based classification offer an annotation-efficient segmentation approach.

翻译：腹腔镜场景的像素级分割对于计算机辅助手术至关重要，但由于密集标注成本高昂，难以大规模应用。我们提出深度引导的手术场景分割（DepSeg），一种无需训练的方法框架，利用单目深度作为几何先验，并结合预训练的视觉基础模型。DepSeg首先通过预训练的单目深度估计网络生成相对深度图，并提出深度引导的点提示，由SAM2转换为类别无关的掩码。每个掩码随后通过池化的预训练视觉特征进行描述，并通过模板匹配与基于标注帧构建的模板库进行分类。在CholecSeg8k数据集上，DepSeg相较于直接使用SAM2自动分割基线（35.9% vs. 14.7% mIoU）有显著提升，即使仅使用10-20%的对象模板仍保持竞争力。这些结果表明，深度引导提示与基于模板的分类提供了一种标注高效的分割方法。