In recent years, deep neural networks have demonstrated increasingly strong abilities to recognize objects and activities in videos. However, as video understanding becomes widely used in real-world applications, a key consideration is developing human-centric systems that understand not only the content of the video but also how it would affect the wellbeing and emotional state of viewers. To facilitate research in this setting, we introduce two large-scale datasets with over 60,000 videos manually annotated for emotional response and subjective wellbeing. The Video Cognitive Empathy (VCE) dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states. The Video to Valence (V2V) dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing. In experiments, we show how video models that are primarily trained to recognize actions and find contours of objects can be repurposed to understand human preferences and the emotional content of videos. Although there is room for improvement, predicting wellbeing and emotional response is on the horizon for state-of-the-art models. We hope our datasets can help foster further advances at the intersection of commonsense video understanding and human preference learning.
翻译:近些年来,深心神经网络显示出越来越强大的认识视频物品和活动的能力,然而,随着视频理解被广泛用于现实世界应用,一个关键的考虑因素是开发以人为中心的系统,不仅了解视频的内容,而且了解它如何影响观众的福祉和情绪状态。为了便利这一环境的研究,我们引入了两套大型数据集,其中有60,000多段视频,为情感反应和主观幸福人工附加附加说明。视频认知感官(VCE)数据集载有细微情感反应分布的说明,允许模型详细了解感官状态。V2V(V2V)数据集包含视频之间相对愉快的描述,从而能够预测持续幸福的频谱。在实验中,我们展示了如何重新制作主要经过培训的视频模型,以识别行动和寻找对象的轮廓,以了解人的喜好和视频的情感内容。尽管有改进的余地,但预测幸福感和情感反应仍然出现在最新模型的视野上。我们希望,我们的数据设置能够促进人类在共同的视频偏好上取得进一步的进展。