This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. The proposed system's architecture has been determined through intensive ablation studies. It fuses the speech & image features and then combines speech, image, and intermediate fusion outputs. The proposed interpretability technique incorporates the divide & conquer approach to compute shapely values denoting each speech & image feature's importance. We have also constructed a large-scale dataset (IIT-R SIER dataset), consisting of speech utterances, corresponding images, and class labels, i.e., 'anger,' 'happy,' 'hate,' and 'sad.' The proposed system has achieved 83.29% accuracy for emotion recognition. The enhanced performance of the proposed system advocates the importance of utilizing complementary information from multiple modalities for emotion recognition.
翻译:本文提出基于混合组合的多式情绪识别系统, 将语言表达和相应图像所描述的情感分为不同类别。 开发了一种新的可解释性技术, 以识别导致预测特定情感类别的重要语言和图像特征。 提议的系统结构是通过密集的消化研究确定的。 它结合了语言和图像特征, 然后将语言、 图像和中间聚变输出结合起来。 提议的可解释性技术包含分化和征服方法, 以计算形状化的数值, 并注明每个语言和图像特征的重要性。 我们还建立了一个大型数据集( IIT- R SIER 数据集), 包括语音表达、 对应图像和类标签, 即“ anger ” 、 “ happy”、“ hate ” 和“ sad. ” 。 提议的系统实现了感知的准确度83.29% 。 提议的系统的增强性能主张必须利用多种模式的补充信息来识别情感。