Video instance segmentation (VIS) is a new and critical task in computer vision. To date, top-performing VIS methods extend the two-stage Mask R-CNN by adding a tracking branch, leaving plenty of room for improvement. In contrast, we approach the VIS task from a new perspective and propose a one-stage spatial granularity network (SG-Net). Compared to the conventional two-stage methods, SG-Net demonstrates four advantages: 1) Our method has a one-stage compact architecture and each task head (detection, segmentation, and tracking) is crafted interdependently so they can effectively share features and enjoy the joint optimization; 2) Our mask prediction is dynamically performed on the sub-regions of each detected instance, leading to high-quality masks of fine granularity; 3) Each of our task predictions avoids using expensive proposal-based RoI features, resulting in much reduced runtime complexity per instance; 4) Our tracking head models objects centerness movements for tracking, which effectively enhances the tracking robustness to different object appearances. In evaluation, we present state-of-the-art comparisons on the YouTube-VIS dataset. Extensive experiments demonstrate that our compact one-stage method can achieve improved performance in both accuracy and inference speed. We hope our SG-Net could serve as a strong and flexible baseline for the VIS task. Our code will be available.
翻译:视频实例分割(VIS)是计算机愿景中一项新的和关键的任务。迄今为止,最佳的VIS方法通过增加一个跟踪分支,扩展了两阶段的遮罩 R-CNN, 留下大量空间改进空间。相比之下,我们从新的角度对待VIS任务,并提出一个单阶段空间颗粒网络(SG-Net)。与传统的两阶段方法相比,SG-Net展示了四个优点:1)我们的方法有一个单阶段的紧凑结构,每个任务头(探测、分解和跟踪)都是相互依存的,以便它们能够有效地分享特征并享受联合优化;2)我们的遮罩预测是动态地在每一个被检测到的子区域进行,导致高品质的微颗粒面罩;3)我们每一项任务预测都避免使用昂贵的基于建议书的RoI功能,从而大大降低运行时间复杂性;4)我们追踪头模型用于跟踪的中心移动,从而有效地加强跟踪不同对象外观的稳健性。在评估中,我们展示了对YouTube-VIS的动态比较,这是我们在VIS数据库中的一项强的精确性工作。我们可以用来展示我们目前的数据模型。