Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark generator (UniFLG). The proposed system exploits end-to-end text-to-speech not only for synthesizing speech but also for extracting a series of latent representations that are common to text and speech, and feeds it to a landmark decoder to generate facial landmarks. We demonstrate that our system achieves higher naturalness in both speech synthesis and facial landmark generation compared to the state-of-the-art text-driven method. We further demonstrate that our system can generate facial landmarks from speech of speakers without facial video data or even speech data.
翻译:供谈话面部生成的两个主要框架包括由文本驱动的框架,它从文本中产生同步的语音和谈话面孔,以及由语言驱动的框架,它从语言中产生说话面孔。为了整合这些框架,本文件建议采用统一的面部标志生成器(UniFLG )。拟议系统不仅利用端到端的文字对口语来合成语言,而且利用一系列潜在的表达方式,这些表达方式在文本和语言中是共同的,并将它注入一个标志性解密器,以产生面部标志。我们表明,与最先进的文本驱动方法相比,我们的系统在语音合成和面部标志生成过程中都具有更高的自然性。我们进一步表明,我们的系统可以产生没有面部视频数据甚至语音数据的语言上的面部标志。</s>