We present a novel, Multi-Window Data Augmentation (MWA-SER) approach for speech emotion recognition. MWA-SER is a unimodal approach that focuses on two key concepts; designing the speech augmentation method to generate additional data samples and building the deep learning models to recognize the underlying emotion of an audio signal. The multi-window augmentation method extracts more audio features from the speech signal by employing multiple window sizes in the audio feature extraction process. We show that our proposed augmentation method, combined with a deep learning model, improves the speech emotion recognition performance. We evaluate the performance of our MWA-SER approach on the IEMOCAP corpus and show that our proposed method achieves state-of-the-art results. Furthermore, the proposed system demonstrated 70% and 88% accuracy while recognizing the emotions for the SAVEE and RAVDESS datasets, respectively.
翻译:我们提出了一种新颖的多窗口数据增强(MWA-SER)方法,用于语音情感识别。 MWA-SER是一种单一方式方法,侧重于两个关键概念;设计语音增强方法,以生成更多数据样本,并构建深层学习模型,以识别音频信号背后的情感。多窗口增强方法通过在音频特征提取过程中使用多个窗口大小从语音信号中提取更多的音频特征。我们表明,我们提议的增强方法,加上深层学习模型,改善了语音情感识别绩效。我们评估了我们的MWA-SER方法在 IEMOCAP Cample上的表现,并表明我们拟议的方法取得了最新的结果。此外,拟议的系统在承认SaveE和REVDESS数据集的情感的同时,分别显示了70%和88%的准确度。