In expressive speech synthesis, there are high requirements for emotion interpretation. However, it is time-consuming to acquire emotional audio corpus for arbitrary speakers due to their deduction ability. In response to this problem, this paper proposes a cross-speaker emotion transfer method that can realize the transfer of emotions from source speaker to target speaker. A set of emotion tokens is firstly defined to represent various categories of emotions. They are trained to be highly correlated with corresponding emotions for controllable synthesis by cross-entropy loss and semi-supervised training strategy. Meanwhile, to eliminate the down-gradation to the timbre similarity from cross-speaker emotion transfer, speaker condition layer normalization is implemented to model speaker characteristics. Experimental results show that the proposed method outperforms the multi-reference based baseline in terms of timbre similarity, stability and emotion perceive evaluations.
翻译:在言语表达合成中,对情感解释的要求很高,然而,由于任意演讲者的推论能力,为他们获取情感音质材料是耗费时间的。针对这一问题,本文件提出一种跨声音情感传输方法,可以实现情感从源演讲者向目标演讲者转移。一组情感象征首先被定义为代表各种类型的情感。它们经过培训后与相应的情感高度相关,以便通过交叉元素损失和半监督的培训战略进行可控合成。同时,为了消除跨声音情感转移导致的低位至小字相似性,对示范演讲者特点实行了语态状态分层正常化。实验结果表明,拟议的方法在质谱相似性、稳定性和情感感知评价方面优于多参照基线。