Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural network models. To address this issue, we present a highly efficient approach to perform real-time incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic Batching. Experimental results demonstrate that the proposed method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a single NVIDIA A10 GPU and significantly outperforms the non-incremental twin in both concurrency and latency. Our work reveals the effectiveness of high-performance incremental TTS on GPUs.
翻译:递增文本到语音,又称流式 TTS, 越来越多地用于在线语音应用,这些应用要求极低反应延迟,以提供最佳用户经验。然而,在 GPU 上部署的现有语音合成管道大部分仍然没有催化作用,这在高通货情景中暴露出局限性,特别是当管道是用端至端神经网络模型建造时。为解决这一问题,我们提出了一个高效的方法,即时请求集合和模效动态聚合对GPU进行实时递增 TTS。实验结果显示,拟议方法能够产生高质量的语音,在单一的 NVDIDIA 10 GPU 上,第一级超低于100米的 QPS 下方,在单一的NVDIA 10 GPU上,以显著超过非致关节式双词。我们的工作揭示了GPUs上高性递增 TTS的实效。