3D gaze estimation is most often tackled as learning a direct mapping between input images and the gaze vector or its spherical coordinates. Recently, it has been shown that pose estimation of the face, body and hands benefits from revising the learning target from few pose parameters to dense 3D coordinates. In this work, we leverage this observation and propose to tackle 3D gaze estimation as regression of 3D eye meshes. We overcome the absence of compatible ground truth by fitting a rigid 3D eyeball template on existing gaze datasets and propose to improve generalization by making use of widely available in-the-wild face images. To this end, we propose an automatic pipeline to retrieve robust gaze pseudo-labels from arbitrary face images and design a multi-view supervision framework to balance their effect during training. In our experiments, our method achieves improvement of 30% compared to state-of-the-art in cross-dataset gaze estimation, when no ground truth data are available for training, and 7% when they are. We make our project publicly available at https://github.com/Vagver/dense3Deyes.
翻译:3D 视觉估计通常以学习输入图像和凝视矢量或其球形坐标之间的直接绘图方式处理。 最近,我们显示,通过将学习目标从几个表面参数修改为密密的 3D 坐标,对脸部、身体和手部进行了估计。 在这项工作中,我们利用这一观察,建议将3D 视觉估计作为3D 眼膜的回归处理。我们通过在现有凝视数据集上安装一个僵硬的 3D 眼球模板,克服了不兼容的地面事实,并提议通过使用广博的面部图像来改进一般化。为此,我们提议建立一个自动管道,从任意的面部图像中取回稳健的假相标签,并设计一个多视角监督框架,以平衡培训期间的效果。 在我们的实验中,我们的方法在交叉数据凝视估计中,当没有用于培训的地面真相数据时,我们实现了30%的改进,在有7%的情况下,我们的项目可以在https://github.com/Vagver/dense3Deys上公开提供。