In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various lengths. The proposed system implements Residual Neural Network architecture. It is trained using softmax pre-training and triplet loss function. The weights between the fully connected and embedding layers of the trained network are used to calculate the embedding values. The embedding representations of various emotions are mapped onto a hyperplane, and the angles among them are computed using the cosine similarity. These angles are utilized to classify a new speech sample into its appropriate emotion class. The proposed system has demonstrated 91.67% and 64.44% accuracy while recognizing emotions for RAVDESS and IEMOCAP dataset, respectively.
翻译:在本文中,基于三重损失和剩余学习的端到端神经嵌入系统被建议用于语音情感识别。 提议的系统从语音语句的情感信息中学习嵌入内容。 学到的嵌入内容用于识别不同长度的语音样本所描绘的情绪。 提议的系统实施残余神经网络结构。 它经过培训,使用软模前训练和三重损失功能。 完全连接和嵌入的网络层之间的权重被用来计算嵌入值。 各种情感的嵌入表达方式被映射在超高平板上, 其角度通过共线相似性来计算。 这些角度被用于将新的语音样本划入适当的情感类别。 提议的系统显示了91.67%和64.44%的准确度,同时分别承认REVDESS和IEMOCAP数据集的情感。