Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos. Prior methods usually obtain segmentation for a frame or clip first, and merge the incomplete results by tracking or matching. These methods may cause error accumulation in the merging step. Contrarily, we propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step. We further build a sequence propagation head on the existing image-level instance segmentation network for long-term propagation. To ensure robustness and high recall of our proposed framework, multiple sequences are proposed where redundant sequences of the same instance are reduced. We achieve state-of-the-art performance on two representative benchmark datasets -- we obtain 47.6% in terms of AP on YouTube-VIS validation set and 70.4% for J&F on DAVIS-UVOS validation set. Code is available at https://github.com/dvlab-research/ProposeReduce.
翻译:视频实例分解(VIS) 旨在将每个框的预定义分类的所有实例进行分解和组合。 先前的方法通常先为框架或剪辑获得分解, 然后通过跟踪或匹配将不完整的结果合并。 这些方法可能会导致合并步骤中的错误积累。 相反, 我们提出一个新的模式 -- -- 提议- REduce, 以单步生成输入视频的完整序列。 我们进一步在现有图像- 级实例分解网络上为长期传播建立一个序列传播头。 为了确保稳健和高调回顾我们提议的框架, 在减少相同实例的冗余序列时, 提议多个序列。 我们在两个具有代表性的基准数据集上取得最新业绩 -- -- 我们获得了YouTube-VIS 验证集中的AP47.6%和DAVIS- UVOS 验证集中的J&F的70.4%。 代码可在 https://github.com/dvlab-resear/ProposeReduce查阅中查阅 。