Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pretrained monocular depth estimation network and proposes depth-guided point prompts, which SAM2 converts into class-agnostic masks. Each mask is then described by a pooled pretrained visual feature and classified via template matching against a template bank built from annotated frames. On the CholecSeg8k dataset, DepSeg improves over a direct SAM2 auto segmentation baseline (35.9% vs. 14.7% mIoU) and maintains competitive performance even when using only 10--20% of the object templates. These results show that depth-guided prompting and template-based classification offer an annotation-efficient segmentation approach.
翻译:腹腔镜场景的像素级分割对于计算机辅助手术至关重要,但由于密集标注成本高昂,难以大规模应用。我们提出深度引导的手术场景分割(DepSeg),一种无需训练的方法框架,利用单目深度作为几何先验,并结合预训练的视觉基础模型。DepSeg首先通过预训练的单目深度估计网络生成相对深度图,并提出深度引导的点提示,由SAM2转换为类别无关的掩码。每个掩码随后通过池化的预训练视觉特征进行描述,并通过模板匹配与基于标注帧构建的模板库进行分类。在CholecSeg8k数据集上,DepSeg相较于直接使用SAM2自动分割基线(35.9% vs. 14.7% mIoU)有显著提升,即使仅使用10-20%的对象模板仍保持竞争力。这些结果表明,深度引导提示与基于模板的分类提供了一种标注高效的分割方法。