The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance. Our code and trained models are available at https://github.com/SysCV/MaskFreeVis.
翻译:近来视频实例分割(VIS)的发展主要是由于采用更深入且需要更多数据的基于变换器的模型。然而,视频遮罩在注释时繁琐且昂贵,极大地限制了现有VIS数据集的规模和多样性。在这项工作中,我们旨在消除遮罩注释要求。我们提出了MaskFreeVIS,它使用边界框注释对象状态,同时实现了极具竞争力的VIS性能。我们借助视频中丰富的时间遮罩一致性约束,通过引入时间KNN-patch Loss(TK-Loss)提供强大的遮罩监督,无需任何标签。我们的TK-Loss通过有效的补丁匹配步骤和K最近邻选择在帧之间找到一对多匹配。然后对找到的匹配施加一致性损失。我们的无遮挡目标简单易操作,没有可训练参数,计算效率高,但是优于采用例如最先进的光流来强制时间遮罩一致性的基线。我们在YouTube-VIS 2019/2021、OVIS和BDD100K MOTS基准测试上验证了MaskFreeVIS。结果清楚地证明,我们的方法显着缩小了全面和弱监督VIS性能之间的差距。我们的代码和训练模型可在https://github.com/SysCV/MaskFreeVis获取。