Talking face generation aims at generating photo-realistic video portraits of a target person driven by input audio. Due to its nature of one-to-many mapping from the input audio to the output video (e.g., one speech content may have multiple feasible visual appearances), learning a deterministic mapping like previous works brings ambiguity during training, and thus causes inferior visual results. Although this one-to-many mapping could be alleviated in part by a two-stage framework (i.e., an audio-to-expression model followed by a neural-rendering model), it is still insufficient since the prediction is produced without enough information (e.g., emotions, wrinkles, etc.). In this paper, we propose MemFace to complement the missing information with an implicit memory and an explicit memory that follow the sense of the two stages respectively. More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details. Our experimental results show that our proposed MemFace surpasses all the state-of-the-art results across multiple scenarios consistently and significantly.
翻译:以声频驱动的目标人物的图像现实化视频生成为目的的谈话面相生成旨在生成由输入音频驱动的图像。 由于从输入音频到输出视频的一到多种映像的性质(例如,一个语音内容可能具有多种可行的视觉外观),学习像以前作品那样的确定性映像在培训期间会产生模糊性,从而造成低级视觉结果。 虽然这种一到多种映像可以部分地通过一个两阶段框架(即音频到表达模型,然后是神经反应模型)来缓解这种暗含的映像,但还是不够,因为预测是在没有足够信息(例如,情感、皱纹等)的情况下生成的。 在本文中,我们建议MemFace用隐含的记忆和明确的记忆来补充缺失的信息,分别遵循两个阶段的感知觉。更具体地说,在音频到表达模型中使用隐含的记忆来捕捉音频表达共享空间的高层次的语义表达,同时在神经反应模型中使用明确的记忆来帮助合成像级图像。我们提出的所有远方的实验结果都显示。