Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent in the process of TTS is a very interesting research direction, and has attracted more and more attention. Recent work design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. However, such a control method lacks interpretability, and there is no direct correlation between the controlling factor and natural accent intensity. To this end, this paper propose a new intuitive and explicit accent intensity control scheme for accented TTS. Specifically, we first extract the posterior probability, called as ``goodness of pronunciation (GoP)'' from the L1 speech recognition model to quantify the phoneme accent intensity for accented speech, then design a FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity expression into account during speech generation. Experiments show that the our method outperforms the baseline model in terms of accent rendering and intensity control.
翻译:偏重文本到语音合成(TTS) 旨在生成带有口音(L2)的语音,作为标准版本(L1)的变体。 如何在 TTS 过程中控制口音强度是一个非常有趣的研究方向,并吸引了越来越多的注意力。 最近的工作设计了语音对称损失,以解析扬声器和口音信息,然后调整损失重量以控制口音强度。 然而,这种控制方法缺乏可解释性,控制因素与自然口音强度之间没有直接的关联。 为此,本文件为 口音 TTS 提出了一个新的直观和明确的口音强度控制方案。 具体而言,我们首先从 L1 语音识别模型中提取了被称为“ 音调的好” 的远端概率。 然后设计了一个基于快速Speech2 的 TTSTS 模型,名为 Ai-TTS, 以考虑到语音生成过程中的口音强度表达方式。 实验显示,我们的方法在口音表达和强度控制方面超过了基线模型。