While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. Moreover, the predicted segmentations often fluctuate over time, suggesting that temporal consistency cues are neglected or not fully utilized. In this paper, we set out to tackle these issues, with the aim of achieving highly detailed and more temporally stable mask predictions for VIS. We first propose the Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Our VMT detects and groups sparse error-prone spatio-temporal regions of each tracklet in the video segment, which are then refined using both local and instance-level cues. Second, we identify that the coarse boundary annotations of the popular YouTube-VIS dataset constitute a major limiting factor. Based on our VMT architecture, we therefore design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset, consisting of a manually re-annotated test set and our automatically refined training data. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS benchmarks. Experimental results clearly demonstrate the efficacy and effectiveness of our method on segmenting complex and dynamic objects, by capturing precise details.
翻译:虽然录相区划(VIS)取得了迅速的进展,但目前的方法是难以预测高质量的面具,并准确的边界细节。此外,预测的区块往往随时间而波动,表明时间一致性提示被忽视或没有得到充分利用。在本文件中,我们提出解决这些问题,目的是为VIS实现非常详细和更加稳定的面具预测。我们首先提出视频面具 Transfiner (VMT) 方法,由于一个高效的视频变压器结构,能够利用微微的高分辨率特征。我们的视频变压器探测和组的每个音轨段的低差易出错区块-时空区域,然后使用本地和实例一级的提示加以改进。第二,我们确定流行的YouTubeVI数据集的粗糙边界说明是一个主要的限制因素。根据我们的VMT结构,我们因此设计了一个自动的注解改进方法,通过反复的培训和自我校正的图像变压器结构。我们为VIS、HQ-YTVI的低误位数据区域,然后用最精确的手动测试方法来对我们的动态S-RV-TV系统进行升级。