Transformers have been widely used for video processing owing to the multi-head self attention (MHSA) mechanism. However, the MHSA mechanism encounters an intrinsic difficulty for video inpainting, since the features associated with the corrupted regions are degraded and incur inaccurate self attention. This problem, termed query degradation, may be mitigated by first completing optical flows and then using the flows to guide the self attention, which was verified in our previous work - flow-guided transformer (FGT). We further exploit the flow guidance and propose FGT++ to pursue more effective and efficient video inpainting. First, we design a lightweight flow completion network by using local aggregation and edge loss. Second, to address the query degradation, we propose a flow guidance feature integration module, which uses the motion discrepancy to enhance the features, together with a flow-guided feature propagation module that warps the features according to the flows. Third, we decouple the transformer along the temporal and spatial dimensions, where flows are used to select the tokens through a temporally deformable MHSA mechanism, and global tokens are combined with the inner-window local tokens through a dual perspective MHSA mechanism. FGT++ is experimentally evaluated to be outperforming the existing video inpainting networks qualitatively and quantitatively.
翻译:由于多头自我关注(MHSA)机制,在视频处理过程中广泛使用变压器,然而,MHSA机制在视频涂色方面遇到了内在的内在困难,因为与腐败地区相关的特征已经退化,引起不准确的自我关注。这个问题,即所谓的查询退化,可以通过首先完成光学流,然后利用流导变压器引导自我关注来缓解,这在我们先前的工作-流导变压器(FGT)中已经核实。我们进一步利用流动指导,并提议FGT++ 来追求更有效和高效的视频涂色。首先,我们通过使用本地集成和边缘损失设计一个轻量流完成网络。第二,为解决查询退化问题,我们提议了一个流动指导特性整合模块,利用运动差异来增强这些特征,同时使用流导变压的特性传播模块来引导自我关注。第三,我们根据时间和空间层面对变压器进行分解,通过时间和空间层面的MHSA机制,将流动用于选择标语。同时将全球标物与当前后端图像网络相结合,通过双向质量图像系统。