Recently, removing objects from videos and filling in the erased regions using deep video inpainting (VI) algorithms has attracted considerable attention. Usually, a video sequence and object segmentation masks for all frames are required as the input for this task. However, in real-world applications, providing segmentation masks for all frames is quite difficult and inefficient. Therefore, we deal with VI in a one-shot manner, which only takes the initial frame's object mask as its input. Although we can achieve that using naive combinations of video object segmentation (VOS) and VI methods, they are sub-optimal and generally cause critical errors. To address that, we propose a unified pipeline for one-shot video inpainting (OSVI). By jointly learning mask prediction and video completion in an end-to-end manner, the results can be optimal for the entire task instead of each separate module. Additionally, unlike the two stage methods that use the predicted masks as ground truth cues, our method is more reliable because the predicted masks can be used as the network's internal guidance. On the synthesized datasets for OSVI, our proposed method outperforms all others both quantitatively and qualitatively.
翻译:最近,使用深层视频粉刷(VI)算法在被清除区域从视频中删除和填充物体的做法引起了相当的注意。 通常, 所有框架都需要视频序列和对象分割面罩作为此任务的投入。 但是,在现实应用中,为所有框架提供分解面面罩相当困难且效率低下。 因此, 我们只用一枪处理六, 仅将初始框架的对象遮罩作为输入。 虽然我们可以用视频对象分割(VOS)和六方法的天真的组合来实现这一点, 但它们是次优化的, 通常会造成严重错误。 为了解决这个问题, 我们提议了一发视频涂图(OSVI)的统一管道。 通过以端对端方式共同学习掩码预测和完成视频, 其结果可以优化于整个任务, 而不是每个单独的模块。 此外, 与使用预言的遮罩作为地面真相提示的两种阶段方法不同, 我们的方法比较可靠, 因为预测的遮罩可以用作网络的内部指导。 在合成数据集的OSVI, 我们提议的方法将质量超出所有其他的定性。</s>