Movie story analysis requires understanding characters' emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset, we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted state-of-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx's self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues.
翻译:电影故事分析需要理解角色的情感和心理状态。为了实现这一目标,我们将情感理解形式化为在电影场景的角色级别和每个角色的多标签预测。 我们提出了EmoTx,这是一种多模态基于Transformer的架构,可以处理视频、多个角色和对话话语,以进行联合预测。通过利用来自MovieGraphs数据集的标注,我们旨在预测经典情感(例如快乐、愤怒)和其他心理状态(例如诚实、乐于助人)。我们对最常出现的10和25个标签以及将181个标签聚类为26个标签的映射进行了实验。消融研究和与改进后的最先进情感识别方法的比较显示了EmoTx的有效性。分析EmoTx的自我关注分数表明,表情丰富的情感通常会注意角色令牌,而其他心理状态则依赖于视频和对话线索。