We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
翻译:我们引入了一种技术,用低维可训练的语音嵌入器增强神经文本到语音(TTS),从单一模型产生不同的声音。作为起点,我们展示了单声器神经TTS:深音1和塔科坦两种最先进的语言方法的改进。我们引入了深音2,该技术基于与深音1的类似管道,但以高性能构件制成,并展示了比深音1的音质显著改进。我们通过引入后处理神经电动器改进了塔科吨,并展示了显著的音质改进。然后我们展示了我们在两个多声 TTS数据集上用于深音2和塔科特朗的多声器语音合成技术。我们展示了单一神经TS系统能够从每个发言者不到半小时的数据中学习数百个独特的声音,同时实现高音质合成并保存发言者身份几乎完美。