1D-CNN-LSTM-GRU 语音情感识别数据增强模型组合 1D-CNN-CNN-LSTM-GRU (An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for Speech Emotion Recognition)

In this paper, we propose an ensemble of deep neural networks along with data augmentation (DA) learned using effective speech-based features to recognize emotions from speech. Our ensemble model is built on three deep neural network-based models. These neural networks are built using the basic local feature acquiring blocks (LFAB) which are consecutive layers of dilated 1D Convolutional Neural networks followed by the max pooling and batch normalization layers. To acquire the long-term dependencies in speech signals further two variants are proposed by adding Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM) layers respectively. All three network models have consecutive fully connected layers before the final softmax layer for classification. The ensemble model uses a weighted average to provide the final classification. We have utilized five standard benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D for evaluation. We have performed DA by injecting Additive White Gaussian Noise, pitch shifting, and stretching the signal level to generalize the models, and thus increasing the accuracy of the models and reducing the overfitting as well. We handcrafted five categories of features: Mel-frequency cepstral coefficients, Log Mel-Scaled Spectrogram, Zero-Crossing Rate, Chromagram, and statistical Root Mean Square Energy value from each audio sample. These features are used as the input to the LFAB blocks that further extract the hidden local features which are then fed to either fully connected layers or to LSTM or GRU based on the model type to acquire the additional long-term contextual representations. LFAB followed by GRU or LSTM results in better performance compared to the baseline model. The ensemble model achieves the state-of-the-art weighted average accuracy in all the datasets.

翻译：在本文中, 我们提出一组深层神经网络, 以及数据增强( DA), 使用有效的语音特征学习数据增强( DA) 来识别来自言语的情绪。我们的组合模型建在三个深层神经网络模型上。这些神经网络是使用基本的本地特征获取区块( LFAB) 建造的, 这些区块是扩展的 1D 进化神经网络的连续一层, 并随后是最大集合和批量正常化层。为了在语音信号中取得长期依赖性能, 还提出了两个变量, 分别增加了 Gated 常规单位( GRU) 和 LLSTM( LSTM ) 层。所有三个网络模型在最后软分子网络模型分类之前都有连续完全连接的层。组合模型使用一个加权平均数来提供最后分类。我们使用了五个标准基准数据集: TESS、 EMO- DB、 RAVEEEE 和 CREMA- D 评估。我们通过注入白度常规测试( GRU) 和 LS- NLS- 网络连接, 和将信号水平数据扩展到模型升级到模型的模型, 将每组的底值数据转化为的基数级数据转换为SMLVLM- mal- mal- sal- mal- mal- sal- sal- sal- sal- sal- sal- sal- sl- s- s- s- sal- salmalmalmal- sal- sal- salmalmalationalationalational- sal- sal- saldald- sald- sal- sald- sal- sal- sald- sald- sald- sald- sald- sald- sald- sald- sald- sald- salmald- salmaldaldalmaldald- sald- sal- sal- sald- sal- sal- sal- sal- sald- sald- sald- sald- sal- sal- sald- sald- sal- sal- s

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/