Affective computing models are essential for human behavior analysis. A promising trend of affective system is enhancing the recognition performance by analyzing the contextual information over time and across modalities. To overcome the limitations of instantaneous emotion recognition, the 2018 IJCNN challenge on One-Minute Gradual-Emotion Recognition (OMG-Emotion) encourages the participants to address long-term emotion recognition using multiple modalities data like facial expression, audio and language context. Compared with single modality models given by the baseline method, a multi-modal inference network can leverage the information from each modality and their correlations to improve the performance of recognition. In this paper, we propose a multi-modal architecture which uses facial, audio and language context features to recognize human sentiment from utterances. Our model outperforms the provided unimodal baseline, and achieves the concordance correlation coefficients (CCC) 0.400 of arousal task, and 0.353 of valence task.
翻译:情感计算模型对人类行为分析至关重要。感官系统的一个有希望的趋势是,通过分析时间和不同模式的背景信息,提高认知性能。为了克服瞬间情感识别的局限性,2018年国际JCNN关于一微分梯度感化识别(OMG-Emove)的挑战鼓励参与者使用面部表达、音频和语言背景等多种模式数据解决长期情感识别问题。与基线方法给出的单一模式模型相比,多式推论网络可以利用每种模式的信息及其相关性来改进认知性能。在本文中,我们建议建立一个多模式架构,利用面部、音频和语言背景特征来识别语句中的人情。我们的模型超越了所提供的单式基线,并实现了和谐相关系数(CCC)0.400和0.353的值任务。