A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions. We leverage the insight that strong gaze-related geometric constraints exist when people perform the activity of "looking at each other" (LAEO). To acquire viable 3D gaze supervision from LAEO labels, we propose a training algorithm along with several novel loss functions especially designed for the task. With weak supervision from two large scale CMU-Panoptic and AVA-LAEO activity datasets, we show significant improvements in (a) the accuracy of semi-supervised gaze estimation and (b) cross-domain generalization on the state-of-the-art physically unconstrained in-the-wild Gaze360 gaze estimation benchmark. We open source our code at https://github.com/NVlabs/weakly-supervised-gaze.
翻译:物理上不受约束的视力估计所面临的一项重大挑战是获取培训数据,用3D眼视说明提供全天候和户外情景。相反,在不受限制的环境中,人类互动的视频可大量提供,而且更容易用框架一级活动标签附加说明。在这项工作中,我们处理以前未探讨过的问题,即从人类互动的视频中进行不受监督的目视估计;我们利用这种洞察力,即当人们从事“相互看”的活动时,存在与视觉有关的强力制约(LAEO)。为了从LAEO的标签中获得可行的3D目监测,我们建议采用培训算法,并配有专门为任务设计的几种新的损失功能。由于两个大型CMU-Panphic和AVA-LAAEO活动数据集的监管薄弱,我们在(a)半监督的视视场估计的准确性,以及(b)在“不受约束的现代Gaze360视觉估计基准上交叉概括化。我们在https://github.com/NVAslably/weakly。