Accent plays a significant role in speech communication, influencing understanding capabilities and also conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's speech that is converted to any desired target accent. Our thorough experiments validate the effectiveness of our proposed framework using both objective and subjective evaluations. The results also show remarkable performance in terms of the ability to manipulate accents in the synthesized speech and provide a promising avenue for future accented TTS research.
翻译:Acent在语言交流、影响理解能力和传达个人身份方面起着重要作用。本文介绍了基于条件变异自动编码的重音语音合成新颖而有效的框架。它能够将选定的发言者的演讲转换成任何想要的目标口音。我们彻底的实验利用客观和主观的评价来验证我们拟议框架的有效性。结果还显示,在控制合成语音口音的能力方面表现显著,并为未来的重音TTS研究提供了有希望的渠道。