This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target TTS. The embedding is extracted from the previous segment of a fixed length in the proposed BERT. The extracted embedding is then used together with the mel-spectrogram to predict the following segment in the TTS decoder. Experimental results obtained by the Transformer TTS show that the proposed BERT can extract fine-grained, segment-level prosody, which is complementary to utterance-level prosody to improve the final prosody of the TTS speech. The objective distortions measured on a single speaker TTS are reduced between the generated speech and original recordings. Subjective listening tests also show that the proposed approach is favorably preferred over the TTS without the BERT prosody embedding module, for both in-domain and out-of-domain applications. For Microsoft professional, single/multiple speakers and the LJ Speaker in the public database, subjective preference is similarly confirmed with the new BERT prosody embedding. TTS demo audio samples are in https://judy44chen.github.io/TTSSpeechBERT/.
翻译:本文提供了一个语音 BERT 模型, 用于在语音部分中提取嵌入的假信息, 以改善神经文本到语音中合成语音的假肢。 作为预先培训的模型, 它可以从大量语音数据中学习假肢属性, 这比目标 TTS 使用的原始培训数据使用更多的数据。 嵌入是从拟议 BERT 中固定长度的前部分提取的。 然后, 提取的嵌入与Mel-spectrogrogram一起使用, 以预测 TTS 脱coder 中的以下部分。 变换 TTTS 获得的实验结果显示, 拟议的 BERT 可以提取精细微的、 部位化的 prosocial 属性, 这是对TTS TTS 发言的最后版本的补充。 对单个发言者 TTTS 的客观扭曲值在生成的演讲和原始记录之间减少。 音频监测试还表明, 拟议的方法在TTSTS 中, 没有BERT Protody 嵌嵌嵌嵌嵌嵌入模块, IMS 和 IMUP/ IMLIVLTLTER 。 在新的S/ IMLB 上, IMBlevill IM IM/ IMBSLSBSBSBSBSBSBSBSBS 上, AS 。