This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.
翻译:本文介绍了一种为WildSpoof挑战赛TTS赛道开发的轻量级文本转语音系统。我们的方法采用自纯化流匹配技术,对近期发布的开源权重TTS模型 \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}} 进行微调,以实现对真实场景语音的鲁棒性适应。SPFM通过比较每个样本上的条件流匹配损失与无条件流匹配损失来缓解标签噪声,将可疑的文本-语音对路由至无条件训练流程,同时仍利用其声学信息。最终模型在所有参赛团队中取得了最低的词错误率,并在UTMOS和DNSMOS等感知指标中排名第二。这些结果表明,当与SPFM等显式的噪声处理机制结合时,像Supertonic这样高效的开源权重架构能够有效地适应多样化的真实世界语音条件。