Precise localization of polyp is crucial for early cancer screening in gastrointestinal endoscopy. Videos given by endoscopy bring both richer contextual information as well as more challenges than still images. The camera-moving situation, instead of the common camera-fixed-object-moving one, leads to significant background variation between frames. Severe internal artifacts (e.g. water flow in the human body, specular reflection by tissues) can make the quality of adjacent frames vary considerately. These factors hinder a video-based model to effectively aggregate features from neighborhood frames and give better predictions. In this paper, we present Spatial-Temporal Feature Transformation (STFT), a multi-frame collaborative framework to address these issues. Spatially, STFT mitigates inter-frame variations in the camera-moving situation with feature alignment by proposal-guided deformable convolutions. Temporally, STFT proposes a channel-aware attention module to simultaneously estimate the quality and correlation of adjacent frames for adaptive feature aggregation. Empirical studies and superior results demonstrate the effectiveness and stability of our method. For example, STFT improves the still image baseline FCOS by 10.6% and 20.6% on the comprehensive F1-score of the polyp localization task in CVC-Clinic and ASUMayo datasets, respectively, and outperforms the state-of-the-art video-based method by 3.6% and 8.0%, respectively. Code is available at \url{https://github.com/lingyunwu14/STFT}.
翻译:在胃肠内分泌镜中,聚氨酯的精密本地化对于早期癌症筛查至关重要。内镜检查提供的视频不仅带来更丰富的背景信息,也带来比静止图像更多的挑战。照相机移动情况,而不是普通的相机固定的物体移动情况,导致框架之间的背景差异很大。严格的内部工艺品(例如人体水流、组织对视镜的反射)可以使相邻框架的质量发生差异。这些因素阻碍了基于视频的模型,从而无法有效地将周边框架的特征汇总起来,并作出更好的预测。在本文件中,我们介绍了空间时地变形(STFT),这是一个多框架合作框架来解决这些问题。从空间上看,STFT在镜头移动情况中减少了跨框架的变化,通过建议引导的变形变形变形变形变形变形变形。Temorally,STFT提出一个有色感知觉的模块,以同时估计相基的相框的质量和相关性和相关性。Emplicalalalaludal 研究以及更优的结果展示了我们的方法的有效性和稳定性和稳定性。例如,STFFFFFFF1的缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩的缩缩的缩缩缩缩缩缩的缩缩缩缩缩缩缩的缩缩缩缩缩缩缩的缩缩缩缩缩的缩的缩缩缩图。