We present a novel audio-driven facial animation approach that can generate realistic lip-synchronized 3D facial animations from the input audio. Our approach learns viseme dynamics from speech videos, produces animator-friendly viseme curves, and supports multilingual speech inputs. The core of our approach is a novel parametric viseme fitting algorithm that utilizes phoneme priors to extract viseme parameters from speech videos. With the guidance of phonemes, the extracted viseme curves can better correlate with phonemes, thus more controllable and friendly to animators. To support multilingual speech inputs and generalizability to unseen voices, we take advantage of deep audio feature models pretrained on multiple languages to learn the mapping from audio to viseme curves. Our audio-to-curves mapping achieves state-of-the-art performance even when the input audio suffers from distortions of volume, pitch, speed, or noise. Lastly, a viseme scanning approach for acquiring high-fidelity viseme assets is presented for efficient speech animation production. We show that the predicted viseme curves can be applied to different viseme-rigged characters to yield various personalized animations with realistic and natural facial motions. Our approach is artist-friendly and can be easily integrated into typical animation production workflows including blendshape or bone based animation.
翻译:我们展示了一种新颖的听力驱动面部动画法,它能从输入的音频中产生现实的唇同步3D面部动画。 我们的方法通过语音视频学习感应动态,制作动画家友好的相对曲线,支持多语种的语音输入。 我们的方法的核心是使用电话前的对齐算法,利用电话来提取语音视频中的相对参数。 在电话的指引下,提取的面部曲线可以更好地与电话相连接,从而对动画家更加可控和友好。 为了支持多语言的语音投入和对看不见声音的可概括性,我们利用在多种语言上预先训练的深音频功能模型来从音频到图像曲线的映像。 我们的声向曲线映像能够实现最新状态的功能,即使输入的音频会因音量、声调、速度或噪音的扭曲而受到影响。 最后,为高效的语音动画动制作提供了一种图像扫描方法。 我们展示了预测的面面音频曲线和对看不见的声音模型模型模型模型模型可以被应用到不同的人际动动动动动动动动动动动动动动动动。 我们的动动动动动动动动动动动动动动动动动动动动动动画可以很容易地运用了。