Recent advances in text-to-speech (TTS) synthesis, such as Tacotron and WaveRNN, have made it possible to construct a fully neural network based TTS system, by coupling the two components together. Such a system is conceptually simple as it only takes grapheme or phoneme input, uses Mel-spectrogram as an intermediate feature, and directly generates speech samples. The system achieves quality equal or close to natural speech. However, the high computational cost of the system and issues with robustness have limited their usage in real-world speech synthesis applications and products. In this paper, we present key modeling improvements and optimization strategies that enable deploying these models, not only on GPU servers, but also on mobile devices. The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.
翻译:文本到语音合成(TTS)的最近进展,如Tacotron和WaveRNN等,通过将两个组成部分结合起来,使得能够建立一个完全以神经网络为基础的TTS系统。这样的系统在概念上是简单的,因为它只使用石墨或电话输入,使用Mel-spectrogram作为中间特征,直接生成语音样本。这个系统的质量与自然语言相同或接近于自然语言。然而,这个系统的高计算成本和稳健问题限制了其在现实世界语音合成应用程序和产品的使用。我们在本文件中介绍了重要的改进和优化模型战略,使得不仅在GPU服务器上,而且在移动设备上能够部署这些模型。拟议的系统可以比服务器上的实时速度快5x速度和移动设备上的实时速度快3x速度产生高质量的24千赫语言。