Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose Warm Chat, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.
翻译:生成模型发展迅速,已能实现令人印象深刻的说话人生成,使人工智能栩栩如生。然而,现有方法大多仅关注单向肖像动画。即便少数支持双向对话交互的方法也缺乏精确的情感适应能力,这极大地限制了其实际应用。本文提出Warm Chat,一种用于二元交互的新型情感感知说话人生成框架。该方法利用大语言模型(LLMs,例如GPT-4)的对话生成能力,可生成具有丰富情感变化、在说话与聆听状态间无缝切换的时序一致虚拟头像。具体而言,我们设计了一个基于Transformer的头部掩码生成器,其在潜在掩码空间中学习时序一致的运动特征,能够生成任意长度、时序一致的掩码序列以约束头部运动。此外,我们引入了一种交互式对话树结构来表示对话状态转移,其中每个树节点包含子节点/父节点/兄弟节点信息以及当前角色的情感状态。通过执行反向层级遍历,我们从当前节点提取丰富的历史情感线索以指导表情合成。大量实验证明了我们方法的优越性能和有效性。