We present the first comprehensive video polyp segmentation (VPS) study in the deep learning era. Over the years, developments in VPS are not moving forward with ease due to the lack of large-scale fine-grained segmentation annotations. To address this issue, we first introduce a high-quality frame-by-frame annotated VPS dataset, named SUN-SEG, which contains 158,690 colonoscopy frames from the well-known SUN-database. We provide additional annotations with diverse types, i.e., attribute, object mask, boundary, scribble, and polygon. Second, we design a simple but efficient baseline, dubbed PNS+, consisting of a global encoder, a local encoder, and normalized self-attention (NS) blocks. The global and local encoders receive an anchor frame and multiple successive frames to extract long-term and short-term spatial-temporal representations, which are then progressively updated by two NS blocks. Extensive experiments show that PNS+ achieves the best performance and real-time inference speed (170fps), making it a promising solution for the VPS task. Third, we extensively evaluate 13 representative polyp/object segmentation models on our SUN-SEG dataset and provide attribute-based comparisons. Finally, we discuss several open issues and suggest possible research directions for the VPS community.
翻译:多年来,由于缺少大规模细微分解说明,VPS的发展没有顺利地向前推进。为了解决这一问题,我们首先推出一个高质量的逐个框架的附加说明VPS数据集,名为SUN-SEG, 包含来自众所周知的 SUN-SEG 数据库的158,690个结肠镜框架。我们提供了不同类型的额外说明,即属性、对象遮罩、边界、刻字和多边形。第二,我们设计了一个简单而有效的基线,称为PNS+,由全球编码器、本地编码器和自留自留区组成。全球和本地编码器获得一个固定框架和多个连续框架,以从众所周知的 SUN-数据库提取长期和短期空间时空表表,然后由两个NS-数据库逐步更新。广泛的实验显示,PNS+实现了最佳业绩和实时社区速度(170fps),我们设计了一个称为PNS+的PS+, 并广泛比较了全球编码器系统13号数据部分。它为我们第三个模型提供了一种有希望的解决方案。