The study of human gaze behavior in natural contexts requires algorithms for gaze estimation that are robust to a wide range of imaging conditions. However, algorithms often fail to identify features such as the iris and pupil centroid in the presence of reflective artifacts and occlusions. Previous work has shown that convolutional networks excel at extracting gaze features despite the presence of such artifacts. However, these networks often perform poorly on data unseen during training. This work follows the intuition that jointly training a convolutional network with multiple datasets learns a generalized representation of eye parts. We compare the performance of a single model trained with multiple datasets against a pool of models trained on individual datasets. Results indicate that models tested on datasets in which eye images exhibit higher appearance variability benefit from multiset training. In contrast, dataset-specific models generalize better onto eye images with lower appearance variability.
翻译:对自然环境中的人类凝视行为的研究需要精确的视觉估计算法,这种算法对于广泛的成像条件来说是可靠的。然而,算法往往无法在反射文物和隐蔽物的出现下辨别出象状和中学生的中子机器人等特征。先前的工作表明,尽管有这些文物存在,但变幻网络在提取视像特征方面表现优异。然而,这些网络在培训期间无法见的数据上往往表现不佳。这项工作遵循的直觉是,联合培训具有多个数据集的革命网络可以学习对眼睛部分的普遍描述。我们比较了用多数据集训练的单一模型的性能,而用在单个数据集方面受过训练的一组模型进行比较。结果显示,在数据集上测试的模型显示,视像具有更高的外观变异性,从多功能培训中受益。相比之下,特定数据集的模型在视觉图象上比较优于表面变异性更低的视觉图象。