In this paper, we propose an ensemble of deep neural networks along with data augmentation (DA) learned using effective speech-based features to recognize emotions from speech. Our ensemble model is built on three deep neural network-based models. These neural networks are built using the basic local feature acquiring blocks (LFAB) which are consecutive layers of dilated 1D Convolutional Neural networks followed by the max pooling and batch normalization layers. To acquire the long-term dependencies in speech signals further two variants are proposed by adding Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM) layers respectively. All three network models have consecutive fully connected layers before the final softmax layer for classification. The ensemble model uses a weighted average to provide the final classification. We have utilized five standard benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D for evaluation. We have performed DA by injecting Additive White Gaussian Noise, pitch shifting, and stretching the signal level to generalize the models, and thus increasing the accuracy of the models and reducing the overfitting as well. We handcrafted five categories of features: Mel-frequency cepstral coefficients, Log Mel-Scaled Spectrogram, Zero-Crossing Rate, Chromagram, and statistical Root Mean Square Energy value from each audio sample. These features are used as the input to the LFAB blocks that further extract the hidden local features which are then fed to either fully connected layers or to LSTM or GRU based on the model type to acquire the additional long-term contextual representations. LFAB followed by GRU or LSTM results in better performance compared to the baseline model. The ensemble model achieves the state-of-the-art weighted average accuracy in all the datasets.
翻译:在本文中, 我们提出一组深层神经网络, 以及数据增强( DA), 使用有效的语音特征学习数据增强( DA) 来识别来自言语的情绪。 我们的组合模型建在三个深层神经网络模型上。 这些神经网络是使用基本的本地特征获取区块( LFAB) 建造的, 这些区块是扩展的 1D 进化神经网络的连续一层, 并随后是最大集合和批量正常化层。 为了在语音信号中取得长期依赖性能, 还提出了两个变量, 分别增加了 Gated 常规单位( GRU) 和 LLSTM( LSTM ) 层。 所有三个网络模型在最后软分子网络模型分类之前都有连续完全连接的层。 组合模型使用一个加权平均数来提供最后分类。 我们使用了五个标准基准数据集: TESS、 EMO- DB、 RAVEEEE 和 CREMA- D 评估。 我们通过注入 白度 常规测试( GRU) 和 LS- NLS- 网络连接, 和 将信号水平数据扩展到模型升级到模型的模型, 将每组的底值数据转化为的基数级数据转换为SMLVLM- mal- mal- sal- mal- mal- sal- sal- sal- sal- sal- sal- sal- sl- s- s- s- sal- salmalmalmal- sal- sal- salmalmalationalationalational- sal- sal- saldald- sald- sal- sald- sal- sal- sald- sald- sald- sald- sald- sald- sald- sald- sald- sald- salmald- salmaldaldalmaldald- sald- sal- sal- sald- sal- sal- sal- sal- sald- sald- sald- sald- sal- sal- sald- sald- sal- sal- s