Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: https://github.com/Dogter521/LSF-Animation.
翻译:语音驱动的三维面部动画因其能够生成富有表现力且时间同步的数字人而日益受到关注。虽然近期研究已开始探索情感感知动画,但它们仍依赖于显式的独热编码,在给定情感和身份标签的情况下表示身份和情感,这限制了其泛化到未见说话者的能力。此外,语音中固有的情感线索常被忽视,限制了生成动画的自然度和适应性。在本工作中,我们提出了LSF-Animation,这是一个新颖的框架,消除了对显式情感和身份特征表示的依赖。具体而言,LSF-Animation从语音中隐式提取情感信息,并从中性面部网格中捕获身份特征,从而能够在无需手动标签的情况下,提升对未见说话者和情感状态的泛化能力。此外,我们引入了分层交互融合块(HIFB),它利用一个融合令牌来整合双Transformer特征,并更有效地融合情感、运动相关和身份相关的线索。在3DMEAD数据集上进行的大量实验表明,我们的方法在情感表现力、身份泛化能力和动画真实感方面均超越了近期最先进的方法。源代码将在以下地址发布:https://github.com/Dogter521/LSF-Animation。