Video affective understanding, which aims to predict the evoked expressions by the video content, is desired for video creation and recommendation. In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction. In this paper, we propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame. Specifically, the multi-granularity features could be divided into frame-level, clips-level and video-level features, which corresponds to visual-salient content, semantic-context and video theme information. Then the modal attention fusion module is designed to fuse the multi-granularity features and emphasize more affection-relevant modals. Finally, the fused feature is fed into a Mixtures Of Experts (MOE) classifier to predict the expressions. Further employing model-ensemble post-processing, the proposed method achieves the correlation score of 0.02292 in the EEV challenge.
翻译:视频感知理解(MGN-MA)旨在预测视频内容所引用的表达方式,目的是预测视频创建和建议。在最近的 EEV 挑战中,提出了密集感知任务,需要框架一级的感知预测。在本文中,我们提议建立一个多色网络(MGN-MA),采用多色特征来更好地描述目标框架。具体地说,多色特征可以分为框架级别、剪辑级别和视频级别特征,这与直观内容、语义和视频主题信息相对应。然后,模型感知聚合模块的设计是为了结合多色特征,强调更贴切的模型。最后,结合特征被注入专家混合(MOE)分类器以预测表达方式。进一步使用模型集成式后处理,拟议方法在 EEV 挑战中达到了0.02292的对应分。