Recent advances in text-to-speech have significantly improved the expressiveness of synthesized speech. However, it is still challenging to generate speech with contextually appropriate and coherent speaking style for multi-sentence text in audiobooks. In this paper, we propose a context-aware coherent speaking style prediction method for audiobook speech synthesis. To predict the style embedding of the current utterance, a hierarchical transformer-based context-aware style predictor with a mixture attention mask is designed, considering both text-side context information and speech-side style information of previous speeches. Based on this, we can generate long-form speech with coherent style and prosody sentence by sentence. Objective and subjective evaluations on a Mandarin audiobook dataset demonstrate that our proposed model can generate speech with more expressive and coherent speaking style than baselines, for both single-sentence and multi-sentence test.
翻译:摘要:最近的文本转语音技术显著改进了合成语音的表现力。然而,对于多句文本的有声书,生成上下文上恰当、连贯的语音风格仍然具有挑战性。本文提出了一种基于上下文感知的连贯语音风格预测方法,用于有声书语音合成。为了预测当前话语的风格嵌入,设计了一种基于分层Transformer、具有混合注意力掩码的上下文感知型风格预测器,考虑了之前讲话的文本上下文信息和语音风格信息。基于此,我们可以逐句生成具有连贯风格和韵律的长篇语音。对一份普通话有声书数据集进行的客观和主观评估表明,与基线相比,我们提出的模型可以为单句和多句测试生成更具表现力和连贯性的语音风格。