This paper presents our work on phrase break prediction in the context of end-to-end TTS systems, motivated by the following questions: (i) Is there any utility in incorporating an explicit phrasing model in an end-to-end TTS system?, and (ii) How do you evaluate the effectiveness of a phrasing model in an end-to-end TTS system? In particular, the utility and effectiveness of phrase break prediction models are evaluated in in the context of childrens story synthesis, using listener comprehension. We show by means of perceptual listening evaluations that there is a clear preference for stories synthesized after predicting the location of phrase breaks using a trained phrasing model, over stories directly synthesized without predicting the location of phrase breaks.
翻译:本文介绍了我们在端到端TTS系统中关于短语断点预测的研究工作,旨在探讨以下问题:(i)在端到端TTS系统中,是否将明确的短语模型纳入其中具有实用性?(ii)如何评估端到端TTS系统中短语模型的有效性?特别是,在儿童故事合成的情况下,通过使用听众理解力评估了短语断点预测模型的实用性和有效性。我们通过感知听众评估表明,相对于直接合成故事而言,在预测短语断点的情况下,使用训练好的短语模型合成的故事更受欢迎。