Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.
翻译:多模式情绪分析是一个日益受欢迎的研究领域,它把传统语言的情绪分析定义扩大到多式联运结构,而其他相关模式也伴随语言。在本文中,我们提出了多式联运情绪分析问题,作为内部和现代动态的模型。我们引入了一种新颖的模式,称为Tensor融合网络,既学习这种动态,又从终端到终端学习。建议的方法是针对在线视频中口语的不稳定性以及随附的手势和声音。在实验中,我们的模型在多式联运和单式情绪分析方面优于最先进的方法。