Although media content is increasingly produced, distributed, and consumed in multiple combinations of modalities, how individual modalities contribute to the perceived emotion of a media item remains poorly understood. In this paper we present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute to the perceived emotion of media. The data were collected by presenting music videos to participants in three conditions: music, visual, and audiovisual. Participants annotated the music videos for valence and arousal over time, as well as the overall emotion conveyed. We present detailed descriptive statistics for key measures in the dataset and the results of feature importance analyses for each condition. Finally, we propose a novel transfer learning architecture to train Predictive models Augmented with Isolated modality Ratings (PAIR) and demonstrate the potential of isolated modality ratings for enhancing multimodal emotion recognition. Our results suggest that perceptions of arousal are influenced primarily by auditory information, while perceptions of valence are more subjective and can be influenced by both visual and auditory information. The dataset is made publicly available.
翻译:虽然媒体内容的制作、传播和消耗越来越多,并采用多种组合的方式,但个人模式如何促进媒体项目所感知的情感仍然不甚为人理解。本文我们介绍MusicVideos(Muvi),这是用于影响多媒体内容分析的新数据集,用于研究听力和视觉模式如何促进媒体的感知情绪。这些数据是通过在音乐、视觉和视听三个条件下向参与者提供音乐视频收集的。与会者对音乐视频附加了说明,说明其价值和令人振奋的时间,以及所传达的总体情感。我们为数据集中的关键计量和每个条件的特质重要分析结果提供了详细的描述性统计数据。最后,我们提出一个新的转移学习结构,以培训以孤立方式评分的预测模型,并展示孤立模式评级的潜力,以加强对多种情绪的认知。我们的结果表明,对觉悟的感知主要是听力信息的影响,而对价值的感知则比较主观,并且可以受视觉和审情信息的影响。数据集是公开的。