Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: {speech, text, prosody}. The experimental results on both automatic evaluation and human evaluation demonstrate that: 1) the proposed text-speech prosody annotation framework significantly outperforms text-only baselines; 2) the quality of automatic prosodic boundary annotations is comparable to human annotations; 3) TTS systems trained with model-annotated boundaries are slightly better than systems that use manual ones.
翻译:在自然性和可读性方面,Prosodi 边界在文本到语音的合成(TTS)中起着重要作用。然而,获得Prosodi 边界标签依赖于人工说明,而人工说明既费钱又费时。在本文件中,我们提议通过事先经过训练的音频编码器的神经文字语音模型,从文本-音频数据中自动提取Prosodi 边界标签。这一模型预先培训了以三重格式({speech, 文本, prosody})对TTS数据进行单独和联合微调的文本和语音数据数据。自动评价和人类评价的实验结果显示:(1) 拟议的文本-speech prosody 说明框架大大超越了仅文本的基线;(2) 自动Prosodic 边界说明的质量与人类的注释相当;(3) 受过示范附加说明的TTS系统比使用手写边界的系统稍好一点。