We present a Multi-Window Data Augmentation (MWA-SER) approach for speech emotion recognition. MWA-SER is a unimodal approach that focuses on two key concepts; designing the speech augmentation method and building the deep learning model to recognize the underlying emotion of an audio signal. Our proposed multi-window augmentation approach generates additional data samples from the speech signal by employing multiple window sizes in the audio feature extraction process. We show that our augmentation method, combined with a deep learning model, improves speech emotion recognition performance. We evaluate the performance of our approach on three benchmark datasets: IEMOCAP, SAVEE, and RAVDESS. We show that the multi-window model improves the SER performance and outperforms a single-window model. The notion of finding the best window size is an essential step in audio feature extraction. We perform extensive experimental evaluations to find the best window choice and explore the windowing effect for SER analysis.
翻译:我们提出了一种多窗口数据增强(MWA-SER)的语音情感识别方法。 MWA-SER是一种单一方式的方法,侧重于两个关键概念;设计语音增强方法和建立深层次学习模型,以识别音频信号背后的情感。我们提议的多窗口增强方法通过在音频特征提取过程中使用多个窗口大小,从语音信号中产生更多的数据样本。我们展示了我们的增强方法,加上深层学习模型,改善了语音识别性能。我们评估了我们在三个基准数据集:IEMOCAP、SAVEE和RAVDESS上的方法的绩效。我们显示,多窗口模型改进了SER的性能并超越了单一窗口模式。找到最佳窗口大小的概念是音频特征提取过程中的一个重要步骤。我们进行了广泛的实验性评估,以找到最佳窗口选择,并探索SER分析的窗口效应。