Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.
翻译:预测人类注视是人机交互(HCI)中的重要问题。然而,为了在实际的HCI应用中发挥作用,注视预测模型必须具有可扩展性、快速性以及空间和时间注视预测的准确性。最近的扫描路径预测模型集中在面向目标的注意力(搜索)上。由于常规方法使用训练目标检测器来进行所有可能对象的预测,并且需要人类注视数据来进行训练(都不具备可扩展性),因此这种模型在应用方面受到限制。为此,我们提出了一种称为ZeroGaze的新任务,这是一种新的零样本学习变体,用于对以前未经搜索的对象预测注视,并且我们开发了一种新的模型Gazeformer来解决ZeroGaze问题。与现有方法使用对象检测器模块不同,Gazeformer使用自然语言模型对目标进行编码,从而利用扫描路径预测中的语义相似性。我们使用变压器(transformer)的编码器解码器架构,因为变压器对生成上下文表示特别有用。Gazeformer在ZeroGaze设置上比其他模型的表现要好得多。它在标准注视预测的目标存在和目标缺失搜索任务上也优于现有的目标检测模型。除了其改进的性能之外,Gazeformer比最先进的目标出现的视觉搜索模型快五倍以上。