The process of human affect understanding involves the ability to infer person specific emotional states from various sources including images, speech, and language. Affect perception from images has predominantly focused on expressions extracted from salient face crops. However, emotions perceived by humans rely on multiple contextual cues including social settings, foreground interactions, and ambient visual scenes. In this work, we leverage pretrained vision-language (VLN) models to extract descriptions of foreground context from images. Further, we propose a multimodal context fusion (MCF) module to combine foreground cues with the visual scene and person-based contextual information for emotion prediction. We show the effectiveness of our proposed modular design on two datasets associated with natural scenes and TV shows.
翻译:人类影响理解过程涉及从各种来源,包括图像、语言和语言,推断个人特定情感状态的能力。图像的影响感知主要侧重于从突出面貌作物中提取的表达方式。然而,人类感知的情感依赖多种背景提示,包括社会环境、前景互动和周围视觉场景。在这项工作中,我们利用预先培训的视觉语言模型从图像中提取前景背景描述。此外,我们提议了一个多式环境聚合模块,将地面提示与视觉场景和人背景信息相结合,以进行情感预测。我们展示了与自然场景和电视节目相关的两个数据集的拟议模块设计的有效性。</s>