Emotional expressions form a key part of user behavior on today's digital platforms. While multimodal emotion recognition techniques are gaining research attention, there is a lack of deeper understanding on how visual and non-visual features can be used to better recognize emotions in certain contexts, but not others. This study analyzes the interplay between the effects of multimodal emotion features derived from facial expressions, tone and text in conjunction with two key contextual factors: i) the gender of the speaker, and ii) the duration of the emotional episode. Using a large public dataset of 2,176 manually annotated YouTube videos, we found that while multimodal features consistently outperformed bimodal and unimodal features, their performance varied significantly across different emotions, gender and duration contexts. Multimodal features performed particularly better for male speakers in recognizing most emotions. Furthermore, multimodal features performed particularly better for shorter than for longer videos in recognizing neutral and happiness, but not sadness and anger. These findings offer new insights towards the development of more context-aware emotion recognition and empathetic systems.
翻译:情感表达方式是当今数字平台用户行为的一个关键部分。 虽然多式情感识别技术正在引起研究关注,但对于如何利用视觉和非视觉特征更好地认识某些情况下的情感,缺乏更深入的理解。本研究分析了面部表达方式、语气和文字产生的多式情感特征的影响与两个主要背景因素的相互作用:(一) 发言者的性别,和(二) 情感事件的持续时间。我们使用2 176个人工的带有注释的YouTube视频的大型公共数据集发现,虽然多式特征始终优于双式和单式特征,但其性能在不同的情感、性别和时间背景下差异很大。多式特征在认识大多数情感方面对男性演讲者特别好。此外,在承认中立和幸福,而不是悲伤和愤怒方面,多式特征表现得特别短,比长的视频效果要好。这些发现为开发更多有背景意识的情感识别和同情系统提供了新的洞察力。