The rapid advancement of next-token-prediction models has led to widespread adoption across modalities, enabling the creation of realistic synthetic media. In the audio domain, while autoregressive speech models have propelled conversational interactions forward, the potential for misuse, such as impersonation in phishing schemes or crafting misleading speech recordings, has also increased. Security measures such as watermarking have thus become essential to ensuring the authenticity of digital media. Traditional statistical watermarking methods used for autoregressive language models face challenges when applied to autoregressive audio models, due to the inevitable ``retokenization mismatch'' - the discrepancy between original and retokenized discrete audio token sequences. To address this, we introduce Aligned-IS, a novel, distortion-free watermark, specifically crafted for audio generation models. This technique utilizes a clustering approach that treats tokens within the same cluster equivalently, effectively countering the retokenization mismatch issue. Our comprehensive testing on prevalent audio generation platforms demonstrates that Aligned-IS not only preserves the quality of generated audio but also significantly improves the watermark detectability compared to the state-of-the-art distortion-free watermarking adaptations, establishing a new benchmark in secure audio technology applications.
翻译:下一代令牌预测模型的快速发展已导致其在多模态领域的广泛应用,使得生成逼真的合成媒体成为可能。在音频领域,尽管自回归语音模型推动了对话交互的进步,但滥用风险(例如在钓鱼方案中进行身份冒充或制作误导性语音录音)也随之增加。因此,水印等安全措施对于确保数字媒体的真实性变得至关重要。用于自回归语言模型的传统统计水印方法在应用于自回归音频模型时面临挑战,这源于不可避免的“重令牌化失配”——即原始离散音频令牌序列与重令牌化序列之间的差异。为解决这一问题,我们提出了Aligned-IS,一种专为音频生成模型设计的新型无失真水印技术。该方法采用聚类策略,将同一簇内的令牌视为等效,从而有效应对重令牌化失配问题。我们在主流音频生成平台上进行的全面测试表明,与当前最先进的无失真水印适配方案相比,Aligned-IS不仅保持了生成音频的质量,还显著提升了水印的可检测性,为安全音频技术应用树立了新的基准。