With the rapid development of the speech synthesis system, recent text-to-speech models have reached the level of generating natural speech similar to what humans say. But there still have limitations in terms of expressiveness. In particular, the existing emotional speech synthesis models have shown controllability using interpolated features with scaling parameters in emotional latent space. However, the emotional latent space generated from the existing models is difficult to control the continuous emotional intensity because of the entanglement of features like emotions, speakers, etc. In this paper, we propose a novel method to control the continuous intensity of emotions using semi-supervised learning. The model learns emotions of intermediate intensity using pseudo-labels generated from phoneme-level sequences of speech information. An embedding space built from the proposed model satisfies the uniform grid geometry with an emotional basis. In addition, to improve the naturalness of intermediate emotional speech, a discriminator is applied to the generation of low-level elements like duration, pitch and energy. The experimental results showed that the proposed method was superior in controllability and naturalness. The synthesized speech samples are available at https://tinyurl.com/34zaehh2
翻译:随着语音合成系统的迅速发展,最近的文本到语音模型已达到了产生与人所言相似的自然言语的水平。但是,在表达性方面仍然存在局限性。特别是,现有的情感语音合成模型已经表明,使用情感潜伏空间的缩放参数的内插性特征具有可控性;然而,由于情感、演讲者等特征的缠绕,现有模型产生的情感潜伏空间难以控制持续的情感强度。在本文件中,我们提出了一个新的方法,用半监督的学习来控制情绪的持续强度。模型用手机级语音信息序列生成的假标签来学习中间强度的情绪。从拟议模型中搭建的嵌入空间以情感为基础满足统一的电网几何学。此外,为了提高中间情绪语音的自然性,对诸如持续时间、音道和能量等低层次元素的生成应用了歧视。实验结果表明,拟议的方法在控制性和自然性方面优劣。综合语音样本可在 https://tinur2.commexhhh. https://tinur2。