There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adaptive TTS framework that is based on a diffusion model that can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech significantly outperforms recent speaker-adaptive TTS baselines on English benchmarks. Audio samples are available at https://nardien.github.io/grad-stylespeech-demo.
翻译:近年来,由于神经基因模型的进步,在文本到语音合成技术方面取得了显著进展,但是,由于在模仿目标发言者的风格方面,现有关于任何发言者的适应性TTS的方法的准确性不高,因此成绩不尽人意。在这项工作中,我们介绍了Grad-STyleSpeech,这是一个任何发言者的适应性TTS框架,其基础是能够产生与目标发言者的声音极为相似的高度自然的演讲的传播模式,并作了几秒钟的参考演讲。Grad-STyerSpeech明显超越了最近在英语基准上的演讲者适应性TTS基线。音频样本可在https://nardien.github.io/grad-stypeech-de上查阅。