Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos. Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching. These methods may cause error accumulation in the merging step. Contrarily, we propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step. We further build a sequence propagation head on the existing image-level instance segmentation network for long-term propagation. To ensure robustness and high recall of our proposed framework, multiple sequences are proposed where redundant sequences of the same instance are reduced. We achieve state-of-the-art performance on two representative benchmark datasets -- we obtain 47.6% in terms of AP on YouTube-VIS validation set and 70.4% for J&F on DAVIS-UVOS validation set.
翻译:视频实例分解(VIS) 旨在将每个框的预定义类别的所有实例进行分解和组合。 先前的方法通常先为框架或剪辑获得分解, 然后通过跟踪或匹配将不完整的结果合并。 这些方法可能会导致合并步骤中的错误积累。 相反, 我们提出一个新的模式 -- -- 提议- REduce, 以单步生成输入视频的完整序列。 我们进一步在现有图像级别实例分解网络上建立一个序列传播头, 用于长期传播。 为了保证我们提议的框架的稳健性和高调回想起, 我们建议了多个序列, 以减少同一实例的冗余序列。 我们在两个具有代表性的基准数据集上取得最先进的表现 -- -- 我们获得了YouTube-VIS验证集中的AP47.6%, DAVIS- UVOS验证集中的J&F70.4%。