Video inpainting aims to fill the given spatiotemporal holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches. Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance. However, it still suffers from synthesizing blurry texture as well as huge computational cost. Towards this end, we propose a novel Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting with exceptional efficiency. Our proposed DSTT disentangles the task of learning spatial-temporal attention into 2 sub-tasks: one is for attending temporal object movements on different frames at same spatial locations, which is achieved by temporally-decoupled Transformer block, and the other is for attending similar background textures on same frame of all spatial positions, which is achieved by spatially-decoupled Transformer block. The interweaving stack of such two blocks makes our proposed model attend background textures and moving objects more precisely, and thus the attended plausible and temporally-coherent appearance can be propagated to fill the holes. In addition, a hierarchical encoder is adopted before the stack of Transformer blocks, for learning robust and hierarchical features that maintain multi-level local spatial structure, resulting in the more representative token vectors. Seamless combination of these two novel designs forms a better spatial-temporal attention scheme and our proposed model achieves better performance than state-of-the-art video inpainting approaches with significant boosted efficiency.
翻译:视频绘画的目的是以现实的外观填补特定时空洞,但即使采用繁荣的深层学习方法,它仍然是一个具有挑战性的任务。最近的工作将充满希望的变异器结构引入深层视频涂色和取得更好的性能。然而,它仍然受到模糊的纹理合成的影响以及巨大的计算成本的困扰。为此,我们提议了一个新的脱co的时空变异器(DSTT),用于以超乎寻常的效率改进视频涂色。我们提议的DSTT将学习时空注意力的任务分为两个子任务:一个是在同一空间位置上观看不同框架的时态变异器结构,通过时间分层变异的变异体块(DSTTT),在不同的空间变异时空变异体结构上,我们提议的变异体变异体变体变形模型(DSTTTTTTT), 使得我们提议的变形模型更准确地使用背景纹理和移动对象的方法,因此,在不同的空间变异体结构结构中,一个更好的等级结构结构将更牢固的变形结构,从而将更稳定地在结构结构结构结构中,将更稳定地保持。