While modern Text-to-Speech (TTS) systems can produce speech rated highly in terms of subjective evaluation, the distance between real and synthetic speech distributions remains understudied, where we use the term \textit{distribution} to mean the sample space of all possible real speech recordings from a given set of speakers; or of the synthetic samples that could be generated for the same set of speakers. We evaluate the distance of real and synthetic speech distributions along the dimensions of the acoustic environment, speaker characteristics and prosody using a range of speech processing measures and the respective Wasserstein distances of their distributions. We reduce these distribution distances along said dimensions by providing utterance-level information derived from the measures to the model and show they can be generated at inference time. The improvements to the dimensions translate to overall distribution distance reduction approximated using Automatic Speech Recognition (ASR) by evaluating the fitness of the synthetic data as training data.
翻译:虽然现代文本到语音系统可以产生在主观评价方面评分很高的语音,但实际和合成语音分布之间的距离仍然研究不足,我们使用“mextit{sulte”这一术语来表示某一组发言者所有可能真实语音录音的样本空间;或为同一组发言者制作的合成样本。我们利用一系列语音处理措施和各自的瓦瑟斯坦分布距离,评估音响环境中真实和合成语音分布的距离;我们通过向模型提供从计量中得出的发音水平信息,并表明它们可以在推论时间生成。通过评价合成数据是否适合作为培训数据,我们通过评价合成数据是否适合使用自动语音识别,将规模的改进转化为总体分布距离减少。