Generating synchronized and natural lip movement with speech is one of the most important tasks in creating realistic virtual characters. In this paper, we present a combined deep neural network of one-dimensional convolutions and LSTM to generate vertex displacement of a 3D template face model from variable-length speech input. The motion of the lower part of the face, which is represented by the vertex movement of 3D lip shapes, is consistent with the input speech. In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature, and a velocity loss term is adopted to reduce the jitter of generated facial animation. We recorded a series of videos of a Chinese adult speaking Mandarin and created a new speech-animation dataset to compensate the lack of such public data. Qualitative and quantitative evaluations indicate that our model is able to generate smooth and natural lip movements synchronized with speech.
翻译:以语音生成同步和自然的嘴唇运动是创造现实虚拟字符的最重要任务之一。 在本文中,我们展示了一个由单维演动和LSTM组成的深层神经网络,以便从变长语音输入中生成3D模板面模模型的顶部移位。 以 3D 嘴唇形状的顶部移动为代表的脸部下部运动与输入式演讲是一致的。 为了提高网络对不同声音信号的稳健性,我们调整了经过训练的语音识别模型,以提取语音特征,并采用了快速丢失术语来减少生成的面部动画的杂音。我们记录了一系列讲汉语的中国成年人的视频,并创建了新的语音动画数据集,以弥补这种公共数据的缺乏。 定性和定量评估表明,我们的模型能够产生与语音同步的顺畅自然的嘴动。