The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimize each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance and overall emotion. On the other hand, the quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb and Tess, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach. The RH-emo repository is available at: https://github.com/ispamm/rhemo.
翻译:在语音信号中模拟人类情感表达是一项重要但富有挑战性的任务。 高资源需求对语音情绪识别模型的需求,加上情绪标签数据普遍稀缺,是该领域有效解决方案开发和应用的障碍。 在本文中,我们提出了一个共同规避这些困难的方法。 我们的方法叫做RH-em, 是一个全新的半监督架构, 目的是从真实价值单声波谱中提取嵌入的精细嵌入内容, 从而能够使用四位网位定值的网络情感识别网络。 RH- 情绪识别模型是一个混合的实值/ 夸脱性自动电算器网络网络网络网络网络网络网络网络, 该网络由真实价值的编码器组成, 与真实价值的情感分析器和缩略微读值的解码器平行。 一方面, 精选器允许优化与情感相关特性分类的嵌入的每一个潜在轴心轴: 价值、 振奋、 振奋、 压、 支配力和总体情绪变现, 重建, 使潜值的内心等值形成等等等值关系, 需要我们不断测试的内置的内置的内流数据 。</s>