学习解耦的语音与表情驱动的混合形状用于三维说话人脸动画 (Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation)

Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.

翻译：表情是传达人类情感的基础。随着人工智能生成内容（AIGC）的快速发展，逼真且富有表现力的三维面部动画变得日益重要。尽管近期在语音驱动的说话人脸唇形同步方面取得了进展，生成具有情感表现力的说话人脸仍研究不足。主要障碍在于真实情感三维说话人脸数据集的稀缺，这源于数据采集的高成本。为解决此问题，我们将由语音和情感共同驱动的面部动画建模为一个线性叠加问题。利用一个中性表情的三维说话人脸数据集（VOCAset）和一个三维表情序列数据集（Florence4D），我们联合学习了一组由语音和表情驱动的混合形状。我们引入了一种稀疏约束损失，以促进两类混合形状之间的解耦，同时允许模型捕捉训练数据中固有的跨域次级形变。学习到的混合形状可以进一步映射到FLAME模型的表情和下颌姿态参数，从而实现对三维高斯化身的动画驱动。定性与定量实验表明，我们的方法能够自然地生成具有指定表情的说话人脸，同时保持精确的唇形同步。感知研究进一步显示，与现有方法相比，我们的方法在情感表现力方面表现更优，且未损害唇形同步质量。