We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.
翻译:我们提出一种由声音驱动的谈话头头方法,从一个参考图像中产生照片现实性谈话头部视频。在这项工作中,我们应对了两大挑战:(一) 产生与语音动作相匹配的自然头动运动,以及(二) 保持发言者在大头动动中的外观,同时稳定非脸区域。我们首先设计一个头部预测,通过模拟刻板的6D头运动,并建立一个运动觉常态神经网络(RNN)来模拟6D头部运动。这样,预测头部将表现为一个说话头的低频整体运动,从而使我们的后一个网络能够专注于详细的面部运动生成。为了描述整个图像运动动作,我们利用一个基于密集运动动作的现场代表器,以输入音频、头姿和参考图像来制作稠密的运动场。作为基于关键点的图像模型,我们的方法可以更好地限制所生成的视频的时空一致性。最后,一个图像生成网络被用来将光-现实性头动动的视频制作成像头部视频。我们所估计的正动动的图像场和图像形式展示了我们以稳定的方式展示。