Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. However, the generated voice is often not perceptually identifiable by its intended emotion category. To address this problem, we propose a new interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to directly improve the emotion discriminability by interacting with a speech emotion recognition (SER) model. Moreover, we formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization. Experimental results demonstrate that the proposed i-ETTS outperforms the state-of-the-art baselines by rendering speech with more accurate emotion style. To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.
翻译:近年来,情感文本对语音合成(ETTS)取得了很大进展。然而,生成的声音通常无法被其预期情感类别所识别。为了解决这一问题,我们提议为ETTS提供一个新的互动培训模式,称为i-ETTS, 称为i-ETTS, 旨在通过与语言情感识别模式互动,直接改善情感差异性。此外,我们还制定了一个迭代培训战略,强化学习,以确保i-ETTS优化质量。实验结果显示,提议的i-ETTS通过以更准确的情感风格进行演讲,超过了最新水平的基线。 据我们所知,这是在情感文本对语音合成中加强学习的第一项研究。