Exploring to what humans pay attention in dynamic panoramic scenes is useful for many fundamental applications, including augmented reality (AR) in retail, AR-powered recruitment, and visual language navigation. With this goal in mind, we propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-/object-level saliency detection tasks, we focus on audio-induced salient object detection (SOD), where the salient objects are labeled with the guidance of audio-induced eye movements. To support this task, we collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy, thus distinguishing itself with richness, diversity and quality. Specifically, each sequence is marked with both its super-/sub-class, with objects of each sub-class being further annotated with human eye fixations, bounding boxes, object-/instance-level masks, and associated attributes (e.g., geometrical distortion). These coarse-to-fine annotations enable detailed analysis for PV-SOD modelling, e.g., determining the major challenges for existing SOD models, and predicting scanpaths to study the long-term eye fixation behaviors of humans. We systematically benchmark 11 representative approaches on ASOD60K and derive several interesting findings. We hope this study could serve as a good starting point for advancing SOD research towards panoramic videos. The dataset and benchmark will be made publicly available at https://github.com/PanoAsh/ASOD60K.
翻译:探索人类对动态全景场景的关注对于许多基本应用是有用的,包括零售、AR动力招聘和视觉语言导航中的强化现实(AR),包括零售、AR动力招聘和视觉语言导航中的强化现实(AR)。考虑到这一目标,我们提议PV-SOD,这是一项新任务,目的是从全景视频中分割突出的物体。与现有的固定-/目标级显著探测任务相比,我们侧重于由声频60级显著物体标记为声频导导眼运动指南的音频导60突出物体探测(SOD)。为了支持这项任务,我们收集了第一个名为ASOD60K的大型数据集,其中包括4K分辨率视频框架,带有六级等级的附加说明,从而区别了自身,与丰富性、多样性和质量不同。具体地说,每个序列都有超/子级的标记,而每个子级的物体则带有人类眼睛固定、捆绑框、对象/智能级面具和相关属性(例如,地球测量扭曲)。这些剖析-直径图像图说明从六级视频框开始进行详细分析,从而确定SOD长期数据模型。