In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels of granularity and examines their prediction from text. We show that utterance-level embeddings have insufficient capacity and phoneme-level tend to introduce instabilities when predicted from text. Word-level representations impose balance between capacity and predictability. As a result, we close the gap in naturalness by 90% between synthetic speech and recordings on LibriTTS dataset, without sacrificing intelligibility.
翻译:在言语表达合成中,广泛采用这一方法,用潜在的假言形式处理培训期间数据的变异性。同一文本可能与各种声学认识相呼应,在文本到语音中被称为一对多绘图问题。在自动编码装置中,从目标信号中提取了字词或电话级的表示,以补充语音输入并简化该映像。本文件比较了不同颗粒水平的预嵌,并审查了其从文本中的预测。我们表明,在文字中预测的话语层嵌入能力不足,而语音层往往带来不稳定性。字级表示在能力和可预测性之间是平衡的。因此,我们在不牺牲智能的情况下,将合成话和LibriTTS数据集的录音之间的自然差距缩小了90%。