We present a method that generates expressive talking heads from a single facial image with audio as the only input. In contrast to previous approaches that attempt to learn direct mappings from audio to raw pixels or points for creating talking faces, our method first disentangles the content and speaker information in the input audio signal. The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking head dynamics. Another key component of our method is the prediction of facial landmarks reflecting speaker-aware dynamics. Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion and also animate artistic paintings, sketches, 2D cartoon characters, Japanese mangas, stylized caricatures in a single unified framework. We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating generated talking heads of significantly higher quality compared to prior state-of-the-art.
翻译:我们提出一种方法,从单一面部图像中产生表达式说话头部,以音频为唯一的输入。与以往试图从音频到生像素或制作声学面孔点的直接绘图方法不同,我们的方法首先在输入音频信号中分离内容和声频信息。音频内容强有力地控制了嘴唇和附近面部区域的动作,而发言者信息确定了面部表达方式的具体特点和音频头动态的其余部分。我们方法的另一个关键组成部分是预测反映声频动态的面部标志。基于这一中间代表,我们的方法能够将整个说话头部的摄影真实性视频与全系列运动和动画、草图、2D漫画人物、日本漫画、在单一统一框架内的石化漫画。我们除了用户研究外,还展示了我们方法的广泛定量和定性评价,展示了与先前的艺术水平相比质量要高得多的语音头。