Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames. Recent end-to-end video instance segmentation methods are capable of performing object segmentation and instance association together in a direct parallel sequence decoding/prediction framework. Although these methods generally predict higher quality object segmentation masks, they can fail to associate instances in challenging cases because they do not explicitly model the temporal instance consistency for adjacent frames. We propose a consistent end-to-end video instance segmentation framework with Inter-Frame Recurrent Attention to model both the temporal instance consistency for adjacent frames and the global temporal context. Our extensive experiments demonstrate that the Inter-Frame Recurrent Attention significantly improves temporal instance consistency while maintaining the quality of the object segmentation masks. Our model achieves state-of-the-art accuracy on both YouTubeVIS-2019 (62.1\%) and YouTubeVIS-2021 (54.7\%) datasets. In addition, quantitative and qualitative results show that the proposed methods predict more temporally consistent instance segmentation masks.
翻译:视频实例截断法旨在预测每个框架的物体分离面罩,以及将多个框架的情况联系起来。最近的端到端视频实例截断法能够在一个直接平行的序列解码/定位框架内,同时进行物体分离和实例关联。虽然这些方法一般预测物体分离面罩的质量较高,但无法在具有挑战性的案件中将情况联系起来,因为它们没有明确模拟相邻框架的时间实例一致性。我们提议一个一致的端到端视频实例截断框架,与跨频频频频谱经常注意模拟相邻框架和全球时间背景下的时间实例一致性。我们的广泛实验表明,频谱经常注意在保持物体分离面罩质量的同时,大大提高了时间实例的一致性。我们的模型在YouTubeVIS-2019 (62.1 ⁇ ) 和YouTubeVIS-2021 (54.7 ⁇ ⁇ ) 数据集上都实现了最新水平的精确度。此外,定量和定性结果显示,拟议的方法预测了时间一致性更强的实例分割面罩。