Recently, emotional speech synthesis has achieved remarkable performance. The emotion strength of synthesized speech can be controlled flexibly using a strength descriptor, which is obtained by an emotion attribute ranking function. However, a trained ranking function on specific data has poor generalization, which limits its applicability for more realistic cases. In this paper, we propose a deep learning based emotion strength assessment network for strength prediction that is referred to as StrengthNet. Our model conforms to a multi-task learning framework with a structure that includes an acoustic encoder, a strength predictor and an auxiliary emotion predictor. A data augmentation strategy was utilized to improve the model generalization. Experiments show that the predicted emotion strength of the proposed StrengthNet are highly correlated with ground truth scores for seen and unseen speech. Our codes are available at: https://github.com/ttslr/StrengthNet.
翻译:最近,情感语言合成取得了显著的成绩。合成语言的情感强度可以通过强度描述器来灵活控制,该描述器是通过情感属性排序功能获得的。然而,对特定数据经过培训的排序功能缺乏一般化,因此限制了对更现实案例的适用性。在本文中,我们提议建立一个深层次学习的情感强度评估网络,用于强度预测,称为“UrightNet”。我们的模型符合多任务学习框架,其结构包括声学编码器、强度预测器和辅助情感预测器。数据增强战略被用来改进模型的概括化。实验表明,提议的“UpingNet”预测的情感强度与所看到的和看不见的演讲的地面真实分数高度相关。我们的代码可以在https://github.com/tslr/StrengthNet上查阅。