This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real-time applications on both devices. The full end-to-end system can generate almost natural quality speech, which is verified by listening tests.
翻译:本文展示了一种终端到终端文本到语音系统,在CPU上使用低档,适合实时应用。该系统由自动递减关注序列到序列的声学模型和波形生成的LPCNetvocoder组成。提出了采用Tacotron 1和2模型模块的声学模型结构,同时通过使用最近提议的纯基于位置的注意机制确保了稳定性,适合任意的句长生成。在推断过程中,解译器是无滚动的,声学特征生成是以流动方式进行的,使几乎固定的悬浮与句长无关。实验结果显示,声学模型能够产生比计算机CPU和移动CPU实时大约快31倍和6.5倍的特征序列,使其能够满足两种设备实时应用所需的条件。全端到终端系统能够产生几乎自然质量的语音,通过监听测试加以验证。