Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.
翻译:部署大规模扩散模型以实现实时、无限时长、音频驱动的虚拟人生成,是一项重大的工程挑战,这主要源于计算负载与严格延迟约束之间的冲突。现有方法通常通过强制使用严格单向的注意力机制或降低模型容量来牺牲视觉保真度。为解决此问题,我们引入了\textbf{SoulX-LiveTalk},这是一个为高保真实时流式传输优化的140亿参数框架。不同于传统的单向范式,我们采用了一种\textbf{自校正双向蒸馏}策略,在视频块内保留了双向注意力。这种设计保留了关键的时空相关性,显著增强了运动连贯性和视觉细节。为确保无限生成过程中的稳定性,我们引入了一种\textbf{多步回顾性自校正机制},使模型能够从累积误差中自主恢复并防止崩溃。此外,我们设计了一套全栈推理加速套件,包含混合序列并行、并行VAE以及内核级优化。广泛的评估证实,SoulX-LiveTalk是首个达到\textbf{亚秒级启动延迟(0.87秒)}并实现\textbf{32 FPS}实时吞吐量的140亿规模系统,为高保真交互式数字人合成树立了新标准。