We propose a deep graph approach to address the task of speech emotion recognition. A compact, efficient and scalable way to represent data is in the form of graphs. Following the theory of graph signal processing, we propose to model speech signal as a cycle graph or a line graph. Such graph structure enables us to construct a Graph Convolution Network (GCN)-based architecture that can perform an accurate graph convolution in contrast to the approximate convolution used in standard GCNs. We evaluated the performance of our model for speech emotion recognition on the popular IEMOCAP and MSP-IMPROV databases. Our model outperforms standard GCN and other relevant deep graph architectures indicating the effectiveness of our approach. When compared with existing speech emotion recognition methods, our model achieves comparable performance to the state-of-the-art with significantly fewer learnable parameters (~30K) indicating its applicability in resource-constrained devices.
翻译:我们提出一个深图方法来解决言语情感识别的任务。一个精密、高效和可扩展的表达数据的方法是图表形式。根据图形信号处理理论,我们建议以循环图或线形图模式模拟语音信号。这种图形结构使我们能够建立一个基于图表的图集网络架构,能够与标准GCN中所使用的近似图变化形成对比,从而实现精确的图变。我们评估了我们在流行的IEMOCAP和MSP-IMPROV数据库中的语音识别模型的性能。我们的模型优于标准GCN和其他相关的深图解结构,表明我们的方法的有效性。与现有的语音感知方法相比,我们的模型取得了与最先进的图象相似的性能,其可学习的参数(~30K)要少得多。我们评估了我们在受资源限制的装置中的语音识别模型的性能。