The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicated way, leading to a great difficulty for the AM to predict. Although high-fidelity audio can be generated by recent neural vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the predicted mel-spectrogram from AM degrades the performance of the entire TTS system. In this work, we propose VQTTS, consisting of an AM txt2vec and a vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic feature rather than mel-spectrogram. We redesign both the AM and the vocoder accordingly. In particular, txt2vec basically becomes a classification model instead of a traditional regression model while vec2wav uses an additional feature encoder before HifiGAN generator for smoothing the discontinuous quantized feature. Our experiments show that vec2wav achieves better reconstruction performance than HifiGAN when using self-supervised VQ acoustic feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art performance in terms of naturalness among all current publicly available TTS systems.
翻译:主流神经文本到语音管道(TTS)是一个级联系统,包括一个声学模型(AM),它预测输入笔录中的声学特征,以及一个根据特定声学特性生成波形的电动代码。然而,目前TTS系统中的声学特征通常是Mel-光谱仪,这在时间轴和频率轴上都以复杂的方式高度关联,导致AM很难预测。虽然高不忠实音频音频可以来自地心(GT) 光谱(AM) 的近期神经声学变声学模型(National-vocol-spectrogram), GTTS系统与预测的光谱变声学变异器之间的鸿沟。在这个工作中,我们建议VQTTTS系统由AM txt2 和电解码变光仪组成,它使用自我监督矢量传传传传的传声器音功能,而不是Mexferal Q-contrographer 系统。我们重新设计了AM 和SettyGSlal 系统,然后开始一种更精确的性变变变动系统。