This paper addresses a new problem of understanding human gaze communication in social videos from both atomic-level and event-level, which is significant for studying human social interactions. To tackle this novel and challenging problem, we contribute a large-scale video dataset, VACATION, which covers diverse daily social scenes and gaze communication behaviors with complete annotations of objects and human faces, human attention, and communication structures and labels in both atomic-level and event-level. Together with VACATION, we propose a spatio-temporal graph neural network to explicitly represent the diverse gaze interactions in the social scenes and to infer atomic-level gaze communication by message passing. We further propose an event network with encoder-decoder structure to predict the event-level gaze communication. Our experiments demonstrate that the proposed model improves various baselines significantly in predicting the atomic-level and event-level gaze
翻译:本文探讨在原子层面和事件层面的社会视频中理解人类凝视交流的新问题,这对研究人类社会互动具有重要意义。为了应对这个新颖而具有挑战性的问题,我们贡献了一个大型的视频数据集,即VACATION,它涵盖各种日常社会场景和视觉交流行为,完整地说明原子层面和事件层面的物体和人类面貌、人类关注以及通信结构和标签。我们提议与VACATION一起建立一个时空图神经网络,以明确代表社会舞台上各种凝视互动,并通过传递信息来推断原子层面的凝视交流。我们进一步提议建立一个带有编码器-分解器结构的活动网络,以预测事件层面的凝视交流。我们的实验表明,拟议的模型在预测原子层面和事件层面的凝视方面大大改进了各种基线。