Recurrent Neural Networks (RNNs) have become the standard modeling technique for sequence data, and are used in a number of novel text-to-speech models. However, training a TTS model including RNN components has certain requirements for GPU performance and takes a long time. In contrast, studies have shown that CNN-based sequence synthesis technology can greatly reduce training time in text-to-speech models while ensuring a certain performance due to its high parallelism. We propose a new text-to-speech system based on deep convolutional neural networks that does not employ any RNN components (recurrent units). At the same time, we improve the generality and robustness of our model through a series of data augmentation methods such as Time Warping, Frequency Mask, and Time Mask. The final experimental results show that the TTS model using only the CNN component can reduce the training time compared to the classic TTS models such as Tacotron while ensuring the quality of the synthesized speech.
翻译:循环神经网络(RNNs)已经成为序列数据建模的标准技术,并在许多新的TTS模型中使用。但是,包括RNN组件的TTS模型的训练需要GPU性能的特定要求,并且需要很长时间。相反,研究表明,基于CNN的序列合成技术可以大大减少文本到语音模型的训练时间,同时由于其高并发性而保证一定的性能。我们提出了一种基于深度卷积神经网络的新的文本到语音系统,该系统不使用任何RNN组件(循环单元)。同时,我们通过一系列数据增强方法(如时间扭曲,频率屏蔽和时间屏蔽)来改善模型的普适性和鲁棒性。最终的实验结果表明,仅使用CNN组件的TTS模型可以减少训练时间,相比于Tacotron等经典TTS模型的质量也得到了保证。