This work explores the effect of gender and linguistic-based vocal variations on the accuracy of emotive expression classification. Emotive expressions are considered from the perspective of spectral features in speech (Mel-frequency Cepstral Coefficient, Melspectrogram, Spectral Contrast). Emotions are considered from the perspective of Basic Emotion Theory. A convolutional neural network is utilised to classify emotive expressions in emotive audio datasets in English, German, and Italian. Vocal variations for spectral features assessed by (i) a comparative analysis identifying suitable spectral features, (ii) the classification performance for mono, multi and cross-lingual emotive data and (iii) an empirical evaluation of a machine learning model to assess the effects of gender and linguistic variation on classification accuracy. The results showed that spectral features provide a potential avenue for increasing emotive expression classification. Additionally, the accuracy of emotive expression classification was high within mono and cross-lingual emotive data, but poor in multi-lingual data. Similarly, there were differences in classification accuracy between gender populations. These results demonstrate the importance of accounting for population differences to enable accurate speech emotion recognition.
翻译:这项工作探索了性别和语言声音变化对情感表达方式分类准确性的影响,从语音中的光谱特征(Mel-频谱 Cepstral Covality,Melspectrogrogram,Spectrotracst)的角度来考虑动听表达方式,从情感理论的角度来考虑情感情绪;利用神经进化网络将感动声音表达方式分类为英语、德语和意大利语的情感音频数据集;通过(一) 比较分析确定适当的光谱特征,(二) 单一、多语种和跨语言情感数据的分类性能,以及(三) 评估性别和语言差异对分类准确性影响的机器学习模式的经验评价,结果表明,光谱特征为增加情感表达方式分类提供了潜在渠道;此外,感动表达分类在单语和跨语言的感官数据中非常准确,但多语种数据也很差;同样,在分类准确性性别人口之间也存在差异;这些结果表明,对人口差异进行核算的重要性,以便准确认识情绪表达方式。