Saliency prediction refers to the computational task of modeling overt attention. Social cues greatly influence our attention, consequently altering our eye movements and behavior. To emphasize the efficacy of such features, we present a neural model for integrating social cues and weighting their influences. Our model consists of two stages. During the first stage, we detect two social cues by following gaze, estimating gaze direction, and recognizing affect. These features are then transformed into spatiotemporal maps through image processing operations. The transformed representations are propagated to the second stage (GASP) where we explore various techniques of late fusion for integrating social cues and introduce two sub-networks for directing attention to relevant stimuli. Our experiments indicate that fusion approaches achieve better results for static integration methods, whereas non-fusion approaches for which the influence of each modality is unknown, result in better outcomes when coupled with recurrent models for dynamic saliency prediction. We show that gaze direction and affective representations contribute a prediction to ground-truth correspondence improvement of at least 5% compared to dynamic saliency models without social cues. Furthermore, affective representations improve GASP, supporting the necessity of considering affect-biased attention in predicting saliency.
翻译:在第一阶段,我们通过关注、观察方向和识别影响来检测两种社会信号。这些特征随后通过图像处理操作转化为时空图。转变后的表现形式传播到第二阶段(GASP),在这个阶段,我们探索了各种结合社会信号的晚融合技术,并引入了两个子网络来引导关注相关的刺激。此外,我们的实验表明,聚合方法为静态融合方法取得了更好的结果,而非融合方法则对每种模式的影响都不为人知,如果与动态显著预测的经常模型相结合,则会产生更好的结果。我们显示,凝视方向和感性表述有助于预测地面对等性至少5%的改善,而没有社会提示的动态显著模型。此外,影响性表述改进了GASPA,支持了考虑影响显著关注的必要性。