Speech signals are subjected to more acoustic interference and emotional factors than other signals. Noisy emotion-riddled speech data is a challenge for real-time speech processing applications. It is essential to find an effective way to segregate the dominant signal from other external influences. An ideal system should have the capacity to accurately recognize required auditory events from a complex scene taken in an unfavorable situation. This paper proposes a novel approach to speaker identification in unfavorable conditions such as emotion and interference using a pre-trained Deep Neural Network mask and speech VGG. The proposed model obtained superior performance over the recent literature in English and Arabic emotional speech data and reported an average speaker identification rate of 85.2\%, 87.0\%, and 86.6\% using the Ryerson audio-visual dataset (RAVDESS), speech under simulated and actual stress (SUSAS) dataset and Emirati-accented Speech dataset (ESD) respectively.
翻译:与其它信号相比,语音信号受到更多的声音干扰和情感因素的影响; 噪音中性情绪言论数据是实时语音处理应用程序的一项挑战; 找到有效方法将主导信号与其他外部影响分隔开来至关重要; 理想的系统应有能力准确识别所要求的听力事件与复杂场景的不友好情况; 本文建议采用一种新颖的方法,在情感和干扰等不利条件下,使用预先培训的深神经网络面具和语音VGGG进行语音识别; 拟议的模型比最近英文和阿拉伯文情感言论数据的文献表现优异,并报告使用雷尔森视听数据集(RAVDESS)、模拟和实际压力下的语音数据集和阿联酋鼓励语音数据集的平均语音识别率分别为85.2 ⁇ 、87.0 ⁇ 和86.6 ⁇ 。