This paper addresses the gaze target detection problem in single images captured from the third-person perspective. We present a multimodal deep architecture to infer where a person in a scene is looking. This spatial model is trained on the head images of the person-of- interest, scene and depth maps representing rich context information. Our model, unlike several prior art, do not require supervision of the gaze angles, do not rely on head orientation information and/or location of the eyes of person-of-interest. Extensive experiments demonstrate the stronger performance of our method on multiple benchmark datasets. We also investigated several variations of our method by altering joint-learning of multimodal data. Some variations outperform a few prior art as well. First time in this paper, we inspect domain adaption for gaze target detection, and we empower our multimodal network to effectively handle the domain gap across datasets. The code of the proposed method is available at https://github.com/francescotonini/multimodal-across-domains-gaze-target-detection.
翻译:本文探讨从第三人角度拍摄的单一图像中的目视目标探测问题。 我们展示了一个多式深层结构,以推断一个人在现场的景象。 这个空间模型是用代表丰富背景信息的受关注人、场景和深度地图的头部图像来训练的。 我们的模型不同于先前的几种艺术,不需要对目视角度进行监督,不依赖于头向信息和/或对人眼的定位。 广泛的实验表明我们在多个基准数据集上的方法表现得更强。 我们还通过改变对多式数据的联合学习来调查了我们方法的一些变异。 一些变异也超越了以前的一些艺术。 本文第一次,我们检查域的调整,以探测凝视目标,我们授权我们的多式联运网络有效地处理跨数据集的域差距。 提议方法的代码可以在 https://github.com/francecotonini/Mulmodal-overs-doros-gaze-targ-dection。