Video inpainting aims to fill spatio-temporal "corrupted" regions with plausible content. To achieve this goal, it is necessary to find correspondences from neighbouring frames to faithfully hallucinate the unknown content. Current methods achieve this goal through attention, flow-based warping, or 3D temporal convolution. However, flow-based warping can create artifacts when optical flow is not accurate, while temporal convolution may suffer from spatial misalignment. We propose 'Progressive Temporal Feature Alignment Network', which progressively enriches features extracted from the current frame with the feature warped from neighbouring frames using optical flow. Our approach corrects the spatial misalignment in the temporal feature propagation stage, greatly improving visual quality and temporal consistency of the inpainted videos. Using the proposed architecture, we achieve state-of-the-art performance on the DAVIS and FVI datasets compared to existing deep learning approaches. Code is available at https://github.com/MaureenZOU/TSAM.
翻译:视频图解旨在用可信的内容填充时空“ 破坏” 区域。 为了实现这一目标, 有必要从相邻框架找到通信, 以忠实地幻化未知内容。 目前的方法是通过关注、 流动扭曲或 3D 时变来实现这一目标。 但是, 以流为基础的扭曲可以在光学流不准确的情况下创造文物, 而时间变迁可能会受到空间错配的影响 。 我们提议“ 进步时空特征调整网络 ”, 以光学流从相邻框中扭曲的特征逐渐丰富从当前框架中提取的特征。 我们的方法纠正了时间特征传播阶段的空间错配, 大大提高了插图视频的视觉质量和时间一致性 。 我们利用提议的架构, 实现了 DAVIS 和 FVI 数据集与现有深层学习方法相比的状态性表现 。 代码可在 https://github.com/ MaureenZOU/ TSAM 上查阅 。