Sentiment analysis is the basis of intelligent human-computer interaction. As one of the frontier research directions of artificial intelligence, it can help computers better identify human intentions and emotional states so that provide more personalized services. However, as human present sentiments by spoken words, gestures, facial expressions and others which involve variable forms of data including text, audio, video, etc., it poses many challenges to this study. Due to the limitations of unimodal sentiment analysis, recent research has focused on the sentiment analysis of videos containing time series data of multiple modalities. When analyzing videos with multimodal data, the key problem is how to fuse these heterogeneous data. In consideration that the contribution of each modality is different, current fusion methods tend to extract the important information of single modality prior to fusion, which ignores the consistency and complementarity of bimodal interaction and has influences on the final decision. To solve this problem, a video sentiment analysis method using multi-head attention with bimodal information augmented is proposed. Based on bimodal interaction, more important bimodal features are assigned larger weights. In this way, different feature representations are adaptively assigned corresponding attention for effective multimodal fusion. Extensive experiments were conducted on both Chinese and English public datasets. The results show that our approach outperforms the existing methods and can give an insight into the contributions of bimodal interaction among three modalities.
翻译:感官分析是智能人-计算机互动的基础。作为人工智能的前沿研究方向之一,它可以帮助计算机更好地识别人的意图和情感状态,从而提供更个性化的服务。然而,由于人以口头、手势、面部表情和其他包括文字、音频、视频等不同形式数据的方式表达的情感,它给本研究带来了许多挑战。由于单式情绪分析的局限性,最近的研究侧重于包含多种模式时间序列数据的视频的情感分析。在用多式联运数据分析视频时,关键问题是如何整合这些差异性数据。考虑到每种模式的贡献是不同的,目前的聚合方法往往会提取单一模式在混合之前的重要信息,而这种模式忽视双式互动的一致性和互补性,并对最终决定产生影响。为了解决这一问题,提出了一种使用双式信息多头关注的视频情绪分析方法。根据双式互动,更为重要的双式特征被赋予了更大的份量。在这种方式上,不同的特征展示了对有效的多式互动方法的相应关注。在组合前的单一模式中,这种方式忽略了双式互动方式的连贯性,从而展示了我们现有的双式数据模式。