Audio-driven talking face generation, which aims to synthesize talking faces with realistic facial animations (including accurate lip movements, vivid facial expression details and natural head poses) corresponding to the audio, has achieved rapid progress in recent years. However, most existing work focuses on generating lip movements only without handling the closely correlated facial expressions, which degrades the realism of the generated faces greatly. This paper presents DIRFA, a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio. To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network that can model the variational facial animation distribution conditioned upon the input audio and autoregressively convert the audio signals into a facial animation sequence. In addition, we introduce a temporally-biased mask into the mapping network, which allows to model the temporal dependency of facial animations and produce temporally smooth facial animation sequence. With the generated facial animation sequence and a source image, photo-realistic talking faces can be synthesized with a generic generation network. Extensive experiments show that DIRFA can generate talking faces with realistic facial animations effectively.
翻译:音频驱动的说话人脸生成旨在合成具有逼真面部动画(包括准确的嘴唇运动、生动的面部表情细节和自然的头部姿态)的说话人脸,近年来已取得了快速进展。但是,大多数现有工作仅侧重于生成嘴唇运动而未处理密切相关的面部表情,这极大地降低了生成的面孔的真实感。本文提出了 DIRFA,一种新颖的方法,可以从相同的驱动音频中生成具有多样而逼真的面部动画的说话人脸。为了适应对于相同音频合理变化的面部动画的多样性,我们设计了一种基于转换器的概率映射网络,它可以模拟输入音频所条件的变化面部动画分布,并将音频信号自回归地转换为面部动画序列。此外,我们还将一个时间偏置掩码引入到映射网络中,它允许在建模面部动画的时间依赖性的同时生成具有时间平滑性的面部动画序列。利用生成的面部动画序列和源图像,可以使用通用生成网络合成照片般逼真的说话人脸。广泛的实验表明,DIRFA可以有效地生成具有逼真面部动画的说话人脸。