Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are mainly collected in lab settings, further exacerbating these shortcomings and hindering real-world deployment. To address these challenges, we propose EmoCAST, a diffusion-based talking head framework for precise, text-driven emotional synthesis. Its contributions are threefold: (1) architectural modules that enable effective text control; (2) an emotional talking-head dataset that expands the framework's ability; and (3) training strategies that further improve performance. Specifically, for appearance modeling, emotional prompts are integrated through a text-guided emotive attention module, enhancing spatial knowledge to improve emotion understanding. To strengthen audio-emotion alignment, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide precise facial motion synthesis. Additionally, we construct a large-scale, in-the-wild emotional talking head dataset with emotive text descriptions to optimize the framework's performance. Based on this dataset, we propose an emotion-aware sampling strategy and a progressive functional training strategy that improve the model's ability to capture nuanced expressive features and achieve accurate lip-sync. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST
翻译:情感化说话人像合成旨在生成具有生动表情的说话人像视频。现有方法在控制灵活性、运动自然度和表情质量方面仍存在局限。此外,当前可用数据集主要在实验室环境中采集,进一步加剧了这些不足并阻碍了实际应用。为应对这些挑战,我们提出EmoCAST——一个基于扩散模型的说话人像框架,用于实现精确的文本驱动情感合成。其贡献包括三个方面:(1) 实现有效文本控制的架构模块;(2) 扩展框架能力的情感化说话人像数据集;(3) 进一步提升性能的训练策略。具体而言,在外观建模方面,通过文本引导的情感注意力模块整合情感提示,增强空间知识以提升情感理解能力。为加强音频-情感对齐,我们引入情感化音频注意力模块来捕捉受控情感与驱动音频间的相互作用,生成情感感知特征以指导精确的面部运动合成。此外,我们构建了大规模真实场景下的情感化说话人像数据集,并配备情感文本描述以优化框架性能。基于该数据集,我们提出情感感知采样策略和渐进式功能训练策略,提升了模型捕捉细微表情特征的能力并实现了准确的唇形同步。总体而言,EmoCAST在生成逼真、情感表现力强且音频同步的说话人像视频方面达到了最先进的性能。项目页面:https://github.com/GVCLab/EmoCAST