The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and speech for inference would cause the model to synthesize speech with low content quality. In this study, we propose to mitigate these two problems by using multiple reference audios and style embedding constraints rather than using only the target audio. Multiple reference audios are automatically selected using the sentence similarity determined by Bidirectional Encoder Representations from Transformers (BERT). In addition, we use ''target'' style embedding from a Pre-trained encoder as a constraint by considering the mutual information between the predicted and ''target'' style embedding. The experimental results show that the proposed model can improve the speech naturalness and content quality with multiple reference audios and can also outperform the baseline model in ABX preference tests of style similarity.
翻译:端到端的语音合成模型可以直接作为参考音频, 从文本中生成与引用音频相似的语音, 并生成与引用音频相似的音频。 但是, 在推断过程中必须手工选择适当的音频嵌入。 由于在培训过程中只使用匹配的文本和语音, 使用不匹配的文本和语音进行推理, 导致该模型以低内容质量合成语音。 在这次研究中, 我们提议通过使用多个引用音频和样式嵌入限制, 而不是仅使用目标音频来缓解这两个问题。 多引用音频会自动选择由来自变换器( BERT) 的双向 Eccoder 演示所决定的类似语句。 此外, 我们使用“ 目标” 风格嵌入预训练的编码器作为制约, 其方法是考虑预测和“ 目标” 风格嵌入的相互信息。 实验结果显示, 拟议的模型可以用多个参考音频来改进语言的自然性和内容质量, 并且也可以超越 ABX 风格首选项测试中的基线模型 。