Effective and flexible allocation of visual attention is key for pedestrians who have to navigate to a desired goal under different conditions of urgency and safety preferences. While automatic modelling of pedestrian attention holds great promise to improve simulations of pedestrian behavior, current saliency prediction approaches mostly focus on generic free-viewing scenarios and do not reflect the specific challenges present in pedestrian attention prediction. In this paper, we present Context-SalNET, a novel encoder-decoder architecture that explicitly addresses three key challenges of visual attention prediction in pedestrians: First, Context-SalNET explicitly models the context factors urgency and safety preference in the latent space of the encoder-decoder model. Second, we propose the exponentially weighted mean squared error loss (ew-MSE) that is able to better cope with the fact that only a small part of the ground truth saliency maps consist of non-zero entries. Third, we explicitly model epistemic uncertainty to account for the fact that training data for pedestrian attention prediction is limited. To evaluate Context-SalNET, we recorded the first dataset of pedestrian visual attention in VR that includes explicit variation of the context factors urgency and safety preference. Context-SalNET achieves clear improvements over state-of-the-art saliency prediction approaches as well as over ablations. Our novel dataset will be made fully available and can serve as a valuable resource for further research on pedestrian attention prediction.
翻译:有效和灵活地分配目视关注是行人在不同紧迫和安全偏好条件下实现预期目标的关键。虽然行人注意力自动建模对于改进行人行为模拟极有希望,但当前突出预测方法主要侧重于通用自由观情景,并不反映行人关注预测中存在的具体挑战。在本论文中,我们介绍了Cein-Salnet,这是一个新颖的编码器分解器结构,明确解决行人视觉关注预测的三大挑战:第一,环境-卫星网明确模拟了编码器-脱coder模型潜在空间的环境因素的紧迫性和安全偏好。第二,我们提出了指数加权平均平方差损失(ew-MSE),它能够更好地应对地面真相突出地图中只有一小部分是非零条目这一事实。第三,我们明确提出了缩略图不确定性模型,以说明行人注意预测培训数据有限这一事实。为了评估环境-卫星网,我们记录了VR中行人视觉关注的首套数据集,其中包括对背景因素的紧迫性和安全度进行明确变换,从而更好地预测。背景-Sloe-S-real-net 将完全实现我们现有的预测。