Video Instance Segmentation (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video. Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions. In contrast, we present a simple and efficient single-stage VIS framework based on the instance segmentation method CondInst by adding an extra tracking head. To improve instance association accuracy, a novel bi-directional spatio-temporal contrastive learning strategy for tracking embedding across frames is proposed. Moreover, an instance-wise temporal consistency scheme is utilized to produce temporally coherent results. Experiments conducted on the YouTube-VIS-2019, YouTube-VIS-2021, and OVIS-2021 datasets validate the effectiveness and efficiency of the proposed method. We hope the proposed framework can serve as a simple and strong alternative for many other instance-level video association tasks. Code will be made available.
翻译:视频分解(VIS)是一项同时需要分类、分解和视频实例关联的任务。最近的VIS方法依靠复杂的管道来实现这一目标,包括与RoI有关的操作或3D相联。相反,我们根据实例分解法提出了一个简单而高效的单一阶段VIS框架,增加了一个额外的跟踪头。为了提高实例关联的准确性,提出了一个新的双向时空对比学习战略,用于跟踪跨框架嵌入情况。此外,还利用实例时间一致性计划来产生时间上一致的结果。在YouTube-VIS-2019、YouTube-VIS-2021和OVIS-2021上进行的实验验证了拟议方法的有效性和效率。我们希望拟议的框架能够成为许多其他实例级视频关联任务的一个简单而有力的替代方案。