Robust speech emotion recognition relies on the quality of the speech features. We present speech features enhancement strategy that improves speech emotion recognition. We used the INTERSPEECH 2010 challenge feature-set. We identified subsets from the features set and applied Principle Component Analysis to the subsets. Finally, the features are fused horizontally. The resulting feature set is analyzed using t-distributed neighbour embeddings (t-SNE) before the application of features for emotion recognition. The method is compared with the state-of-the-art methods used in the literature. The empirical evidence is drawn using two well-known datasets: Emotional Speech Dataset (EMO-DB) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) for two languages, German and English, respectively. Our method achieved an average recognition gain of 11.5\% for six out of seven emotions for the EMO-DB dataset, and 13.8\% for seven out of eight emotions for the RAVDESS dataset as compared to the baseline study.
翻译:强烈的言语情绪识别取决于语言特征的质量。我们展示了增强语言特征的战略,以提高语言情感识别能力。我们使用了InterSPEECH 2010 挑战特征集。我们从设置的特征和对子集应用原则组件分析中找出了子集。最后,这些特征是横向结合的。在应用情感识别特征之前,使用T-分布式邻居嵌入器(t-SNE)对由此产生的特征集进行了分析。该方法与文献中使用的最先进的方法进行了比较。经验证据是使用两个众所周知的数据集来提取的:情感语音数据集(EMO-DB)和Ryerson语音和歌曲视听数据库(RAVDES),分别用于两种语言,即德语和英语。与基线研究相比,我们的方法在ERVDESS数据集的7种情感中平均实现了11.5 +++++ 7种情感中6种,在RAVDESS数据集的8种情感中,平均实现了11.8++7种情感中7种(13.8+)。