EmoCAST：基于情感文本描述的情感化说话人像生成 (EmoCAST: Emotional Talking Portrait via Emotive Text Description)

Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are mainly collected in lab settings, further exacerbating these shortcomings and hindering real-world deployment. To address these challenges, we propose EmoCAST, a diffusion-based talking head framework for precise, text-driven emotional synthesis. Its contributions are threefold: (1) architectural modules that enable effective text control; (2) an emotional talking-head dataset that expands the framework's ability; and (3) training strategies that further improve performance. Specifically, for appearance modeling, emotional prompts are integrated through a text-guided emotive attention module, enhancing spatial knowledge to improve emotion understanding. To strengthen audio-emotion alignment, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide precise facial motion synthesis. Additionally, we construct a large-scale, in-the-wild emotional talking head dataset with emotive text descriptions to optimize the framework's performance. Based on this dataset, we propose an emotion-aware sampling strategy and a progressive functional training strategy that improve the model's ability to capture nuanced expressive features and achieve accurate lip-sync. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST

翻译：情感化说话人像合成旨在生成具有生动表情的说话人像视频。现有方法在控制灵活性、运动自然度和表情质量方面仍存在局限。此外，当前可用数据集主要在实验室环境中采集，进一步加剧了这些不足并阻碍了实际应用。为应对这些挑战，我们提出EmoCAST——一个基于扩散模型的说话人像框架，用于实现精确的文本驱动情感合成。其贡献包括三个方面：(1) 实现有效文本控制的架构模块；(2) 扩展框架能力的情感化说话人像数据集；(3) 进一步提升性能的训练策略。具体而言，在外观建模方面，通过文本引导的情感注意力模块整合情感提示，增强空间知识以提升情感理解能力。为加强音频-情感对齐，我们引入情感化音频注意力模块来捕捉受控情感与驱动音频间的相互作用，生成情感感知特征以指导精确的面部运动合成。此外，我们构建了大规模真实场景下的情感化说话人像数据集，并配备情感文本描述以优化框架性能。基于该数据集，我们提出情感感知采样策略和渐进式功能训练策略，提升了模型捕捉细微表情特征的能力并实现了准确的唇形同步。总体而言，EmoCAST在生成逼真、情感表现力强且音频同步的说话人像视频方面达到了最先进的性能。项目页面：https://github.com/GVCLab/EmoCAST

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【ICLR2025】VEVO：基于自监督解耦的可控零样本语音模仿

专知会员服务

9+阅读 · 2月15日

【NeurIPS2024】MoTE：在视觉语言到视频知识转移中协调泛化与专门化

专知会员服务

13+阅读 · 2024年10月16日

UTC: 用于视觉对话的任务间对比学习的统一Transformer

专知会员服务

14+阅读 · 2022年5月4日

【ACL2020-CMU-Google】MobileBERT:用于资源受限设备的任务无关“瘦版”BERT

专知会员服务

13+阅读 · 2020年4月9日