与单级融合和强化学习相结合的多式联运条件分析 (Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning)

With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we develop a novel deep architecture for multimodal sentiment analysis that performs modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding alleviates the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention performs word level fusion at a finer fusion resolution between input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. Our results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion.

翻译：随着YouTube和Facebook等视频共享网站越来越受欢迎,多式情绪分析日益受到科学界的注意。与以往侧重于语音部分整体信息(如一袋文字表达和平均面部表达强度)的多式情绪分析工作相反,我们开发了一个全新的多式情绪分析深层次架构,在文字层面进行模式融合。在本文中,我们提出了由2个模块组成的Gated Modmodal 嵌入式LSTM(GME-LSTM(A))模型。高式多式联运嵌入式模型缓解了出现噪音模式时的融合困难。Temalal 注意的LSTM在投入模式和面部平均面部表达强度等部分中,以更细的融合方式进行字级融合,并关注最重要的时间步骤。因此,GME-LSTM(A)能够通过时间更好地模拟多式演讲的多式结构,并进行更好的情感理解。我们公开使用的多式多式多式组合组合强化和主观性分析(CMUM-MOSI)数据集成的难度。通过实现高压度分析模式分析,从而进行高压性分析,从而进行高压性分析。我们货币分析的图像分析的图像分析,这是我们货币分析。我们货币分析的货币分析的更动分析的更动分析。我们货币分析中的高级分析中新的分析。