Human motion transfer refers to synthesizing photo-realistic and temporally coherent videos that enable one person to imitate the motion of others. However, current synthetic videos suffer from the temporal inconsistency in sequential frames that significantly degrades the video quality, yet is far from solved by existing methods in the pixel domain. Recently, some works on DeepFake detection try to distinguish the natural and synthetic images in the frequency domain because of the frequency insufficiency of image synthesizing methods. Nonetheless, there is no work to study the temporal inconsistency of synthetic videos from the aspects of the frequency-domain gap between natural and synthetic videos. In this paper, we propose to delve into the frequency space for temporally consistent human motion transfer. First of all, we make the first comprehensive analysis of natural and synthetic videos in the frequency domain to reveal the frequency gap in both the spatial dimension of individual frames and the temporal dimension of the video. To close the frequency gap between the natural and synthetic videos, we propose a novel Frequency-based human MOtion TRansfer framework, named FreMOTR, which can effectively mitigate the spatial artifacts and the temporal inconsistency of the synthesized videos. FreMOTR explores two novel frequency-based regularization modules: 1) the Frequency-domain Appearance Regularization (FAR) to improve the appearance of the person in individual frames and 2) Temporal Frequency Regularization (TFR) to guarantee the temporal consistency between adjacent frames. Finally, comprehensive experiments demonstrate that the FreMOTR not only yields superior performance in temporal consistency metrics but also improves the frame-level visual quality of synthetic videos. In particular, the temporal consistency metrics are improved by nearly 30% than the state-of-the-art model.
翻译:人类运动传输是指合成摄影现实和时间一致的视频,使一个人能够模仿他人的动作。然而,目前的合成视频在顺序框中出现时间上的不一致,大大降低视频质量,但远没有被像素域的现有方法所解决。最近,DeepFake探测的一些作品试图区分频率域的自然图像和合成图像,因为图像合成方法的频率不足。然而,没有研究合成视频与自然视频和合成视频频率差异之间的时间差异。在本文中,我们提议将视频切入频率空间空间,以便进行时间一致的人类运动转移。首先,我们对频率域的自然和合成视频进行第一次全面分析,以揭示单个框架空间层面和时间层面的频率差距。为了缩小图像合成视频和合成视频之间的频率差距,我们建议采用新的基于频率的TRansf 全面框架,名为FreMOTR框架,该框架可以有效地减少空间艺术和时间上一致的人类运动视频。我们首先对频率域域域域域域的准确性进行了全面性分析,然后将常规性模型比常规性框架的常规性分析。