Although speech recognition has become a widespread technology, inferring emotion from speech signals still remains a challenge. To address this problem, this paper proposes a quaternion convolutional neural network (QCNN) based speech emotion recognition (SER) model in which Mel-spectrogram features of speech signals are encoded in an RGB quaternion domain. We show that our QCNN based SER model outperforms other real-valued methods in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS, 8-classes) dataset, achieving, to the best of our knowledge, state-of-the-art results. The QCNN also achieves comparable results with the state-of-the-art methods in the Interactive Emotional Dyadic Motion Capture (IEMOCAP 4-classes) and Berlin EMO-DB (7-classes) datasets. Specifically, the model achieves an accuracy of 77.87\%, 70.46\%, and 88.78\% for the RAVDESS, IEMOCAP, and EMO-DB datasets, respectively. In addition, our results show that the quaternion unit structure is better able to encode internal dependencies to reduce its model size significantly compared to other methods.
翻译:尽管语音识别已成为一种广泛的技术,但从言语信号中推断出情感情绪仍然是一个挑战。为解决这一问题,本文件提议了一个基于四进制神经网络(QCNN)的语音识别模型(SER),将语音信号的Mel-spectrogram特性编码在RGB之四域域内。我们显示,我们基于QCNN的SER模型优于Ryerson情感话语和歌曲视听数据库(RAVDESS,8类)中的其他真正估价方法。该模型根据我们的知识,实现最先进的最新结果。QCNN还取得了与交互式情感识别(EICAP 4级)和柏林EMO-DB(7类)中最先进方法相近的结果。具体地说,该模型实现了RAVDESS、IEMOCEC和EMODB数据元组的数据集(887-78 ⁇ )数据集的准确度,以我们的知识最佳的方式实现最新结果。QERN还实现了与互动情感识别图解(IMOC)中最新数据集结构(I)相比,该模型比其他系统更精确地显示其内部结构的精确程度。