Automatic eye gaze estimation is an important problem in vision based assistive technology with use cases in different emerging topics such as augmented reality, virtual reality and human-computer interaction. Over the past few years, there has been an increasing interest in unsupervised and self-supervised learning paradigms as it overcomes the requirement of large scale annotated data. In this paper, we propose RAZE, a Region guided self-supervised gAZE representation learning framework which leverage from non-annotated facial image data. RAZE learns gaze representation via auxiliary supervision i.e. pseudo-gaze zone classification where the objective is to classify visual field into different gaze zones (i.e. left, right and center) by leveraging the relative position of pupil-centers. Thus, we automatically annotate pseudo gaze zone labels of 154K web-crawled images and learn feature representations via `Ize-Net' framework. `Ize-Net' is a capsule layer based CNN architecture which can efficiently capture rich eye representation. The discriminative behaviour of the feature representation is evaluated on four benchmark datasets: CAVE, TabletGaze, MPII and RT-GENE. Additionally, we evaluate the generalizability of the proposed network on two other downstream task (i.e. driver gaze estimation and visual attention estimation) which demonstrate the effectiveness of the learnt eye gaze representation.
翻译:以视觉为基础的辅助技术,在扩大现实、虚拟现实和人-计算机互动等不同新出现专题中使用案例,例如扩大现实、虚拟现实和人-计算机互动,自动眼视估计是一个重要的问题。在过去几年里,人们越来越关注不受监督和自我监督的学习范式,因为它克服了大规模附加说明数据的要求。在本文中,我们建议RAZE,一个区域引导的自我监督的GAZE代表制学习框架,利用非附加说明的面部图像数据。RAZE通过辅助监督,即假gaze区分类,通过利用学生中心相对位置,将视觉场分为不同的视觉区(即左、右、中)。因此,我们自动将154K网络浏览图像的假视区标记作注释,并通过“Ize-Net”框架学习特征描述。“Ize-Net”是一个基于胶层的CNN结构,可以有效捕捉到丰富的眼部代表。在四个基准数据集上,即CAVEVE、表GLGE-GI、MPII、GO-GE-GI、GE-GE-GE-S-S-S-GOVILAviews Leval 和GE-S-S-Sidealvicalvicalvicolvicalvicalview)上,对Gyalview的视野估计进行了评价。