One of the most fundamental and information-laden actions humans do is to look at objects. However, a survey of current works reveals that existing gaze-related datasets annotate only the pixel being looked at, and not the boundaries of a specific object of interest. This lack of object annotation presents an opportunity for further advancing gaze estimation research. To this end, we present a challenging new task called gaze object prediction, where the goal is to predict a bounding box for a person's gazed-at object. To train and evaluate gaze networks on this task, we present the Gaze On Objects (GOO) dataset. GOO is composed of a large set of synthetic images (GOO Synth) supplemented by a smaller subset of real images (GOO-Real) of people looking at objects in a retail environment. Our work establishes extensive baselines on GOO by re-implementing and evaluating selected state-of-the art models on the task of gaze following and domain adaptation. Code is available on github.
翻译:人类最根本、信息密集的行动之一是查看物体。然而,对当前工作的调查显示,现有与凝视有关的数据集仅说明正在观看的像素,而不是特定对象的界限。这种缺乏物体说明为进一步推动视觉估计研究提供了机会。为此,我们提出了一个具有挑战性的新任务,称为凝视物体预测,目标是预测一个人凝视物体的捆绑框。为了培训和评价关于这项任务的凝视网络,我们介绍了Gaze On 目标数据集。Gaze On 数据集由一组大型合成图像组成(GOO Synth),由一组在零售环境中观看物体的人的一小部分真实图像(GO-Real)作为补充。我们的工作通过重新实施和评估关于凝视任务和域调整的选定最新技术模型,为GOO建立了广泛的基线。《Guthub》上有代码。