DMP-TTS：基于链式引导的可控文本到语音合成中的解耦多模态提示方法 (DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance)

Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.

翻译：可控文本到语音（TTS）系统在实现说话人音色与说话风格的独立操控方面面临显著挑战，常因属性间的纠缠而受限。本文提出DMP-TTS，一种具备显式解耦与多模态提示能力的潜在扩散Transformer（DiT）框架。基于CLAP的风格编码器（Style-CLAP）将参考音频与描述性文本的线索对齐至共享空间，并通过对比学习结合风格属性的多任务监督进行训练。为实现推理过程中的细粒度控制，我们引入基于分层条件丢弃训练的链式无分类器引导（cCFG），支持内容、音色和风格引导强度的独立调节。此外，我们采用表示对齐（REPA）方法，将预训练Whisper模型中的声学-语义特征蒸馏至中间DiT表示中，从而稳定训练并加速收敛。实验表明，DMP-TTS在保持可理解性与自然度竞争力的同时，相比开源基线实现了更强的风格可控性。代码与演示将发布于https://y61329697.github.io/DMP-TTS/。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【WWW2025】ImageScope：通过大型多模态模型集体推理统一语言引导的图像检索

专知会员服务

12+阅读 · 4月22日

MM-REACT:提示ChatGPT进行多模态推理和行动

专知会员服务

34+阅读 · 2023年3月26日

《用于代码弱点识别的 LLVM 中间表示》CMU

专知会员服务

14+阅读 · 2022年12月12日

【ICLR 2020】基于组合的多关系图卷积网络 Composition-Based Multi-Relational Graph Convolutional Networks

专知会员服务

108+阅读 · 2020年3月29日