We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object- and human-activity-based feature representations to validate that our method is indeed both content-based and user-specific.
翻译:我们建议一种方法,根据用户以往所观看的视频所选的亮点剪辑,为特定目标视频上的用户检测个人特写亮点。我们的方法明确利用对象和人类活动预先训练的特征,利用对象和目标视频的内容来利用首选剪辑和目标视频的内容。我们设计了一个多头关注机制,根据对象和基于人类活动的内容,对首选剪辑进行适应性加权,并将这些重量结合成每个用户的单一特征表示。我们计算这些用户特写显示与从目标视频中估算用户特写亮点剪的预期目标视频中计算出的每个框架特征特征的相似性。我们用一个大型突出显示的探测数据集测试我们的方法,其中包括个别用户的附加亮点。与目前的基线相比,我们观察到,所探测的亮点的平均精度有2-4%的绝对改进幅度。我们还对与每个用户相关的首选突出点以及基于对象和基于人类活动特征的描述进行了广泛的对比实验,以证实我们的方法确实既基于内容又针对用户的具体特性。