Emotional Voice Conversion (EVC) aims to convert the emotional style of a source speech signal to a target style while preserving its content and speaker identity information. Previous emotional conversion studies do not disentangle emotional information from emotion-independent information that should be preserved, thus transforming it all in a monolithic manner and generating audio of low quality, with linguistic distortions. To address this distortion problem, we propose a novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN). The proposed model achieves favourable results in both the objective evaluation and the subjective evaluation in terms of distortion, which reveals that the proposed model can effectively reduce distortion. Furthermore, in data augmentation experiments for end-to-end speech emotion recognition, the proposed StarGAN model achieves an increase of 2% in Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which indicates that the proposed model is more valuable for data augmentation.
翻译:情感变换(EVC)旨在将源语言信号的情感风格转换成目标风格,同时保留其内容和发言者身份信息。以前的情感变换研究不会将情感信息与应保存的情感独立信息分离,从而以单一的方式将其全部转换,产生低质量的音频,并造成语言扭曲。为了解决这一扭曲问题,我们提议了一个新型StarGAN框架,同时提出一个两阶段培训进程,将情感特征与情感独立的情感区分开来,方法是使用一个自动编码器,用两个编码器作为基因反转网络的生成者。 提议的模型在客观评价和扭曲方面的主观评价两方面都取得了有利的结果,这表明拟议的模型能够有效减少扭曲。 此外,在终端对终端语音识别的数据增强实验中,拟议的StarGAN模型与StarGAN模型相比增加了2%的微-F1和宏观-F1的5%,这表明拟议的模型对数据增扩增更有价值。