Warm Chat：基于树状结构引导的扩散式情感感知交互式说话人头像 (Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance)

Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose Warm Chat, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.

翻译：生成模型发展迅速，已能实现令人印象深刻的说话人生成，使人工智能栩栩如生。然而，现有方法大多仅关注单向肖像动画。即便少数支持双向对话交互的方法也缺乏精确的情感适应能力，这极大地限制了其实际应用。本文提出Warm Chat，一种用于二元交互的新型情感感知说话人生成框架。该方法利用大语言模型（LLMs，例如GPT-4）的对话生成能力，可生成具有丰富情感变化、在说话与聆听状态间无缝切换的时序一致虚拟头像。具体而言，我们设计了一个基于Transformer的头部掩码生成器，其在潜在掩码空间中学习时序一致的运动特征，能够生成任意长度、时序一致的掩码序列以约束头部运动。此外，我们引入了一种交互式对话树结构来表示对话状态转移，其中每个树节点包含子节点/父节点/兄弟节点信息以及当前角色的情感状态。通过执行反向层级遍历，我们从当前节点提取丰富的历史情感线索以指导表情合成。大量实验证明了我们方法的优越性能和有效性。