Speech emotion recognition (SER) has many challenges, but one of the main challenges is that each framework does not have a unified standard. In this paper, we propose SpeechEQ, a framework for unifying SER tasks based on a multi-scale unified metric. This metric can be trained by Multitask Learning (MTL), which includes two emotion recognition tasks of Emotion States Category (EIS) and Emotion Intensity Scale (EIS), and two auxiliary tasks of phoneme recognition and gender recognition. For this framework, we build a Mandarin SER dataset - SpeechEQ Dataset (SEQD). We conducted experiments on the public CASIA and ESD datasets in Mandarin, which exhibit that our method outperforms baseline methods by a relatively large margin, yielding 8.0% and 6.5% improvement in accuracy respectively. Additional experiments on IEMOCAP with four emotion categories (i.e., angry, happy, sad, and neutral) also show the proposed method achieves a state-of-the-art of both weighted accuracy (WA) of 78.16% and unweighted accuracy (UA) of 77.47%.
翻译:语音感官识别(SER)有许多挑战,但主要挑战之一是每个框架没有一个统一的标准。在本文中,我们提议SpeeEQ,这是一个基于多尺度统一SER任务的框架。这个指标可由多任务学习(MTL)来培训,它包括情感国家(EIS)和情感强度尺度(EIS)的两个情感识别任务,以及电话识别和性别识别的两个辅助任务。在这个框架中,我们建立了一个曼达林SER数据集-SpeopeEQ数据集(SEQD )。我们在曼达林对公共CASIA和ESD数据集进行了实验,实验显示我们的方法比基线方法高出相当大的范围,使精度分别提高8.0%和6.5%。关于IEMOCAP的四个情感类别(即愤怒、快乐、悲伤和中性)的额外实验还显示,拟议方法的加权精度(WA)达到78.16%和未加权精确度(UA)达到77.47%的状态。