The prosody of a spoken word is determined by its surrounding context. In incremental text-to-speech synthesis, where the synthesizer produces an output before it has access to the complete input, the full context is often unknown which can result in a loss of naturalness in the synthesized speech. In this paper, we investigate whether the use of predicted future text can attenuate this loss. We compare several test conditions of next future word: (a) unknown (zero-word), (b) language model predicted, (c) randomly predicted and (d) ground-truth. We measure the prosodic features (pitch, energy and duration) and find that predicted text provides significant improvements over a zero-word lookahead, but only slight gains over random-word lookahead. We confirm these results with a perceptive test.
翻译:口头单词的假写由周围环境决定。 在递增文本到语音合成中, 合成器在获得完整输入之前产生输出, 整个背景往往不为人所知, 这可能导致合成语音中的自然性丧失。 在本文中, 我们调查预测未来文本的使用是否能减轻这一损失。 我们比较了下一个未来单词的几种测试条件:(a) 未知(零字), (b) 语言模型预测, (c) 随机预测, (d) 地面真相。 我们测量了预想特征( pitch, 能量和持续时间), 发现预言文本在零字外观上提供了显著的改进, 但仅比随机字外观略有改善。 我们用感性测试来确认这些结果。