Temporally consistent dense video annotations are scarce and hard to collect. In contrast, image segmentation datasets (and pre-trained models) are ubiquitous, and easier to label for any novel task. In this paper, we introduce a method to adapt still image segmentation models to video in an unsupervised manner, by using an optical flow-based consistency measure. To ensure that the inferred segmented videos appear more stable in practice, we verify that the consistency measure is well correlated with human judgement via a user study. Training a new multi-input multi-output decoder using this measure as a loss, together with a technique for refining current image segmentation datasets and a temporal weighted-guided filter, we observe stability improvements in the generated segmented videos with minimal loss of accuracy.
翻译:温度一致的密集视频说明非常稀少,难以收集。 相反,图像分割数据集(和预先培训的模型)无处不在,更容易为任何新任务贴标签。 在本文中,我们采用一种方法,通过光学流基一致性测量方法,使静态图像分割模型以不受监督的方式适应视频。为确保推断的分解视频在实践中更加稳定,我们通过用户研究核实一致性措施与人类判断密切相关。 培训新的多投入多输出多输出编码器,将这一措施作为损失使用,同时使用一种技术来改进当前图像分割数据集和时间加权制导过滤器,我们观察到生成的分解视频的稳定性提高,而准确性损失微乎其微。