Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O($L$) increase in both latency and real-time factor (RTF) with respect to input length $L$. In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low RTF at the same time. Meanwhile, both latency and RTF of the proposed system stay constant regardless of input lengths, making it ideal for real-time applications.
翻译:目前,典型的高质量文本到语音系统(TTS)使用一个两阶段结构,其频谱模型阶段产生光谱框架,而电码阶段产生实际音频。高质量的频谱模型通常包含带有自我注意或双向长期短期(BLSTM)单元的编码器-解码器结构。这些模型可以产生高质量的语音,但在输入长度方面,它们往往会增加内延和实时系数(RTF)O(L$),换句话说,较长的输入导致更长的延迟和慢速合成速度,限制其在实时应用中的使用。在本文件中,我们提出一个多比率的注意结构,通过在编码过程中计算紧凑的表示和在解调过程中以流态的方式反复产生注意矢量,打破延缩和调调调的瓶颈。拟议结构在输入长度上达到高音频质量(MOS 4.31比地心4.48)、低的拉特,以及在同一时间的RTF值低。同时,我们提议对理想系统应用进行实时的延时和RTF,不管其实际输入是否固定。