One compelling application of artificial intelligence is to generate a video of a target person performing arbitrary desired motion (from a source person). While the state-of-the-art methods are able to synthesize a video demonstrating similar broad stroke motion details, they are generally lacking in texture details. A pertinent manifestation appears as distorted face, feet, and hands, and such flaws are very sensitively perceived by human observers. Furthermore, current methods typically employ GANs with a L2 loss to assess the authenticity of the generated videos, inherently requiring a large amount of training samples to learn the texture details for adequate video generation. In this work, we tackle these challenges from three aspects: 1) We disentangle each video frame into foreground (the person) and background, focusing on generating the foreground to reduce the underlying dimension of the network output. 2) We propose a theoretically motivated Gromov-Wasserstein loss that facilitates learning the mapping from a pose to a foreground image. 3) To enhance texture details, we encode facial features with geometric guidance and employ local GANs to refine the face, feet, and hands. Extensive experiments show that our method is able to generate realistic target person videos, faithfully copying complex motions from a source person. Our code and datasets are released at https://github.com/Sifann/FakeMotion
翻译:人工智能的一个令人信服的应用是制作一个目标人物进行任意要求的运动(来源人)的视频。虽然最先进的方法能够合成一个视频,展示类似的大中风运动细节,但它们通常缺乏纹理细节。一个相关的表现形式似乎是扭曲的面部、脚部和手部,而且人类观察员对此类缺陷非常敏感。此外,目前的方法通常使用带有L2损失的GANs来评估所制作的视频的真实性,这必然需要大量的培训样本来学习适当的视频生成所需的纹理细节。在这项工作中,我们从三个方面应对这些挑战:(1) 我们将每个视频框架分解到前台(人)和背景,侧重于生成前台(人)和背景,以降低网络输出的基本维度。(2) 我们提出一个具有理论动机的Gromov-Wasserstein损失,以利从表面到地面图像的绘图。(3)为了强化文字细节,我们用几何制导导法将面特征编码编码编码编码,并使用本地的GANS,从脸、脚和手。广泛的实验表明,我们的方法能够从现实的目标源/软件复制。