Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards more robust video-based gaze estimation using commonly available cameras.
翻译:基于视频的视线估计方法旨在从多帧图像中捕捉人类视线固有的时序动态特性。然而,由于模型必须同时捕获空间和时间关系,其性能不仅受限于单帧内的特征表示,还受多帧间特征关系的制约。我们提出了时空视线网络(ST-Gaze),该模型将CNN主干网络与专用的通道注意力及自注意力模块相结合,以最优方式融合眼部与面部特征。融合后的特征被视为空间序列,从而能够捕获帧内上下文信息,随后该信息通过时间维度传播以建模帧间动态。我们在EVE数据集上评估了所提方法,结果表明ST-Gaze在有无个性化适配的情况下均实现了最先进的性能。此外,我们的消融研究进一步揭示了模型性能的内在机制,证明通过时空循环结构保持并建模帧内空间上下文的方法,本质上优于过早进行空间池化的策略。因此,我们的研究成果为利用常见摄像头实现更鲁棒的视频视线估计开辟了新途径。