This paper presents our work on phrase break prediction in the context of end-to-end TTS systems, motivated by the following questions: (i) Is there any utility in incorporating an explicit phrasing model in an end-to-end TTS system?, and (ii) How do you evaluate the effectiveness of a phrasing model in an end-to-end TTS system? In particular, the utility and effectiveness of phrase break prediction models are evaluated in in the context of childrens story synthesis, using listener comprehension. We show by means of perceptual listening evaluations that there is a clear preference for stories synthesized after predicting the location of phrase breaks using a trained phrasing model, over stories directly synthesized without predicting the location of phrase breaks.
翻译:本文在端到端语音合成系统中研究短语分割预测,研究动机为:(i)在端到端语音合成系统中引入显式短语模型是否具有效用?(ii)如何评估端到端语音合成系统中短语模型的有效性?具体来说,我们以儿童故事合成为背景,使用听众理解度来评估短语分割预测模型的效用和有效性。我们通过感知听评估表明,使用经过训练的短语模型预测短语断点位置合成的故事要优于直接合成的故事,而没有预测短语断点位置。