Developing gaze estimation models that generalize well to unseen domains and in-the-wild conditions remains a challenge with no known best solution. This is mostly due to the difficulty of acquiring ground truth data that cover the distribution of possible faces, head poses and environmental conditions that exist in the real world. In this work, we propose to train general gaze estimation models based on 3D geometry-aware gaze pseudo-annotations which we extract from arbitrary unlabelled face images, which are abundantly available in the internet. Additionally, we leverage the observation that head, body and hand pose estimation benefit from revising them as dense 3D coordinate prediction, and similarly express gaze estimation as regression of dense 3D eye meshes. We overcome the absence of compatible ground truth by fitting rigid 3D eyeballs on existing gaze datasets and design a multi-view supervision framework to balance the effect of pseudo-labels during training. We test our method in the task of gaze generalization, in which we demonstrate improvement of up to $30\%$ compared to state-of-the-art when no ground truth data are available, and up to $10\%$ when they are. The project material will become available for research purposes.
翻译:注视估计模型在新领域和自然界条件下的泛化仍然是一个具有挑战性且没有最佳解决方案的难点。这主要是由于在真实世界中存在的可能面孔、头部姿势和环境条件的分布的标准数据获取困难。在这项研究中,我们提出了一种基于三维几何感知注视伪标注的通用注视估计模型的训练。我们从任意未标记的面部图像中提取这些数据,这些数据在互联网上广泛可用。此外,我们利用以下观察结果:头部、身体和手姿态估计受益于将它们作为密集三维坐标预测进行修订,类似地,我们将注视估计表达为密集三维眼部网格的回归。通过在现有的注视数据集上拟合刚性三维眼球,我们克服了兼容性标准数据缺失的问题,并设计了一个多视图监督框架,以平衡训练过程中伪标签的影响。我们在注视泛化任务中测试了我们的方法,并在无标准数据可用时实现了高达30%的改进,并在有标准数据时实现了多达10%的改进。该项目材料将提供给研究人员以供研究之用。