Text-to-Speech (TTS) diffusion models generate high-quality speech, which raises challenges for the model intellectual property protection and speech tracing for legal use. Audio watermarking is a promising solution. However, due to the structural differences among various TTS diffusion models, existing watermarking methods are often designed for a specific model and degrade audio quality, which limits their practical applicability. To address this dilemma, this paper proposes a universal watermarking scheme for TTS diffusion models, termed Smark. This is achieved by designing a lightweight watermark embedding framework that operates in the common reverse diffusion paradigm shared by all TTS diffusion models. To mitigate the impact on audio quality, Smark utilizes the discrete wavelet transform (DWT) to embed watermarks into the relatively stable low-frequency regions of the audio, which ensures seamless watermark-audio integration and is resistant to removal during the reverse diffusion process. Extensive experiments are conducted to evaluate the audio quality and watermark performance in various simulated real-world attack scenarios. The experimental results show that Smark achieves superior performance in both audio quality and watermark extraction accuracy.
翻译:文本到语音(TTS)扩散模型能够生成高质量语音,这对模型知识产权保护及合法使用场景下的语音溯源提出了挑战。音频水印是一种有前景的解决方案。然而,由于不同TTS扩散模型的结构差异,现有水印方法通常针对特定模型设计且会降低音频质量,这限制了其实际适用性。为解决这一难题,本文提出了一种适用于TTS扩散模型的通用水印方案,称为Smark。该方案通过设计一个轻量级水印嵌入框架实现,该框架在所有TTS扩散模型共有的反向扩散范式下运行。为减轻对音频质量的影响,Smark利用离散小波变换(DWT)将水印嵌入音频相对稳定的低频区域,这确保了水印与音频的无缝融合,并能抵抗反向扩散过程中的移除。我们进行了大量实验,以评估各种模拟真实世界攻击场景下的音频质量与水印性能。实验结果表明,Smark在音频质量与水印提取准确率方面均取得了优异性能。