Audio-driven talking head animation is a challenging research topic with many real-world applications. Recent works have focused on creating photo-realistic 2D animation, while learning different talking or singing styles remains an open problem. In this paper, we present a new method to generate talking head animation with learnable style references. Given a set of style reference frames, our framework can reconstruct 2D talking head animation based on a single input image and an audio stream. Our method first produces facial landmarks motion from the audio stream and constructs the intermediate style patterns from the style reference images. We then feed both outputs into a style-aware image generator to generate the photo-realistic and fidelity 2D animation. In practice, our framework can extract the style information of a specific character and transfer it to any new static image for talking head animation. The intensive experimental results show that our method achieves better results than recent state-of-the-art approaches qualitatively and quantitatively.
翻译:驱动人脸动画的音频是一个具有挑战性的研究课题,具有许多现实应用。最近的工作集中于创建逼真的2D动画,而学习不同的说话或歌唱风格仍然是一个开放性的问题。在本文中,我们提出了一种新的方法来生成带有可学习风格参考的人头动画。给定一组风格参考帧,我们的框架可以基于单个输入图像和音频流重建2D人头动画。我们的方法首先从音频流中产生面部标记运动,并根据风格参考图像构建中间的风格模式。然后,我们将两个输出输入到一个风格感知的图像生成器中,以生成照片逼真和保真的2D动画。在实践中,我们的框架可以提取特定角色的样式信息,并将其转移到任何新的静态图像进行人头动画。详细的实验结果表明,我们的方法在质量和数量上均优于最近的最先进方法。