The prediction of human gaze behavior is important for building human-computer interactive systems that can anticipate a user's attention. Computer vision models have been developed to predict the fixations made by people as they search for target objects. But what about when the image has no target? Equally important is to know how people search when they cannot find a target, and when they would stop searching. In this paper, we propose the first data-driven computational model that addresses the search-termination problem and predicts the scanpath of search fixations made by people searching for targets that do not appear in images. We model visual search as an imitation learning problem and represent the internal knowledge that the viewer acquires through fixations using a novel state representation that we call Foveated Feature Maps (FFMs). FFMs integrate a simulated foveated retina into a pretrained ConvNet that produces an in-network feature pyramid, all with minimal computational overhead. Our method integrates FFMs as the state representation in inverse reinforcement learning. Experimentally, we improve the state of the art in predicting human target-absent search behavior on the COCO-Search18 dataset
翻译:人类视觉行为的预测对于建立能够预见用户注意的人类计算机互动系统非常重要。 计算机视觉模型已经开发出来, 以预测人们在搜索目标对象时所做出的固定。 但是, 当图像没有目标时, 如何进行? 同样重要的是, 要知道人们在找不到目标时如何搜索, 何时会停止搜索 。 在本文中, 我们提出了第一个数据驱动的计算模型, 以解决搜索- 终结问题, 并预测人们在搜索图像中未出现的目标时所作的搜索固定的扫描路径 。 我们将视觉搜索作为模仿学习问题进行模拟, 并代表观众通过使用我们称之为“ 变形特征地图” 的新状态代表的固定获得的内部知识 。 实况调查团将一个模拟的变形视网纳入一个预先训练的Convnet, 产生一个网络内地貌金字塔, 都包含极小的计算管理。 我们的方法将实况调查团整合为州在反强化学习中的代表 。 实验中, 我们改进了在预测CO- Search18 数据设置中预测人类目标- 搜索行为时的状态 。