Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.
翻译:近年来,在学术界和工业界,语言文本(TTS)都取得了快速进步。有些问题自然地出现,TTS系统能否达到人品质量、如何界定/判断质量和如何达到质量。在本文件中,我们首先根据主观计量的统计意义来界定人品质量,并引入适当的判断准则,然后开发称为自然语音的TS系统,在基准数据集上达到人品水平。具体地说,我们利用一个变式自动读数器(VAE)来生成波形版本,并使用几个关键模块来提高从文字到波形版本的上端文本能力,降低演讲后台词的复杂程度,包括电话前台前台、不同期限模型、前台/前台双向模型和VAE的记忆机制。 对流行的LJSpeech数据集的实验性评估显示,我们提议的自然语音系统(VAEE)达到-0.01 CMOS(比较平均评分)到句级的人类录音,而Wilcoxon首次签署了关于这一级别的数据等级的显著差异。