World-wide-web, with the website and webpage as the main interface, facilitates the dissemination of important information. Hence it is crucial to optimize them for better user interaction, which is primarily done by analyzing users' behavior, especially users' eye-gaze locations. However, gathering these data is still considered to be labor and time intensive. In this work, we enable the development of automatic eye-gaze estimations given a website screenshots as the input. This is done by the curation of a unified dataset that consists of website screenshots, eye-gaze heatmap and website's layout information in the form of image and text masks. Our pre-processed dataset allows us to propose an effective deep learning-based model that leverages both image and text spatial location, which is combined through attention mechanism for effective eye-gaze prediction. In our experiment, we show the benefit of careful fine-tuning using our unified dataset to improve the accuracy of eye-gaze predictions. We further observe the capability of our model to focus on the targeted areas (images and text) to achieve high accuracy. Finally, the comparison with other alternatives shows the state-of-the-art result of our model establishing the benchmark for the eye-gaze prediction task.
翻译:以网站和网页为主界面,通过万维网促进重要信息的传播。因此,必须优化它们,改善用户互动,这主要是通过分析用户的行为,特别是用户的视网膜位置。然而,收集这些数据仍被认为是人工和时间密集的。在这项工作中,我们能够开发自动的视网膜估计,作为网站截图的投入。这是通过整理一个统一的数据集来完成的,该数据集包括网站截图、视网膜热映像和网站布局信息,其形式为图像和文本面罩。我们预处理的数据集使我们能够提出一种有效的深层次学习模型,利用图像和文本空间位置,这种模型是通过有效视网膜预测的注意机制加以结合的。在我们的实验中,我们展示了利用我们的统一数据集进行仔细微调的好处,以提高眼网预测的准确性。我们进一步观察了我们的模型以目标区域(图像和文本)为重点实现高准确性的能力。我们预处理过的数据集使我们能够提出一个有效的深层次的基于学习的模型,利用图像和文字的空间位置。最后,通过我们其他选择的模型的进度基准预测显示我们确定结果的进度。