To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include head poses and upper body motions, where the former is generated by an autoregressive probabilistic model which models the head pose distribution of the target person. Upper body motions are deduced from head poses. In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings. Our method generalizes well to wild audio and successfully synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth. Our method also allows explicit control of head poses. Extensive qualitative and quantitative evaluations, along with user studies, demonstrate the superiority of our method over state-of-the-art techniques.
翻译:就我们所知,我们首先提出一个活的系统,产生个性化的摄影现实性谈话头部动画,仅由30个远方的音频信号驱动。我们的系统包含三个阶段。第一阶段是一个深神经网络,从深度的音频特征中提取,并配有将特征投射到目标人的语音空间的多重投影。在第二阶段,我们从预测的音频特征中学习面部动态和动作。预测的动作包括头部姿势和上身动作,前者是由一个自动递减性概率模型生成的,该模型模拟目标人的头部姿势分布。上身动作是从头部姿势中推断出来的。在最后阶段,我们从以前的预测中绘制有条件的地貌图,并用一个候选图像集成成成成成一个图像到图像的翻译网络,以合成摄影真实性图像。我们的方法对野生的音频进行概括,并成功地合成高纤维化的个人面部细节,例如皱纹和牙齿。我们的方法还允许对头部姿势进行明确的控制。广泛的定性和定量评价,同时进行用户研究,显示我们的方法优于状态技术的优势。