Cross-speaker emotion transfer speech synthesis aims to synthesize emotional speech for a target speaker by transferring the emotion from reference speech recorded by another (source) speaker. In this task, extracting speaker-independent emotion embedding from reference speech plays an important role. However, the emotional information conveyed by such emotion embedding tends to be weakened in the process to squeeze out the source speaker's timbre information. In response to this problem, a prosody compensation module (PCM) is proposed in this paper to compensate for the emotional information loss. Specifically, the PCM tries to obtain speaker-independent emotional information from the intermediate feature of a pre-trained ASR model. To this end, a prosody compensation encoder with global context (GC) blocks is introduced to obtain global emotional information from the ASR model's intermediate feature. Experiments demonstrate that the proposed PCM can effectively compensate the emotion embedding for the emotional information loss, and meanwhile maintain the timbre of the target speaker. Comparisons with state-of-the-art models show that our proposed method presents obvious superiority on the cross-speaker emotion transfer task.
翻译:跨声音情感传输语音合成的目的是通过转移另一个(源)发言者所录参考演讲的情感,为目标演讲人综合情绪性言论。在这一任务中,从参考演讲中提取出与演讲人独立的情感嵌入是一个重要作用。然而,这种情感嵌入所传递的情感信息在挤压源演讲人信息的过程中往往被削弱。针对这一问题,本文件提出了一个手动补偿模块(PCM),以弥补情感信息损失。具体地说,PCM试图从事先培训的ASR模型的中间特征中获取依赖演讲人的情感信息。为此,引入了一种具有全球背景的代理补偿编码器,以便从ASR模型的中间特征中获取全球情感信息。实验表明,拟议的PCM能够有效地补偿情感嵌入情感信息损失的情感,同时保持目标演讲人的触觉。与最新模型的比较表明,我们所提议的方法在跨声音情感转移任务上明显具有优势。