We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. Our method works on the graph-based representation of multiple observable human-centric modalities in the videos, such as poses and faces. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions based on these modalities. We train our network to map the activity- and interaction-based latent structural representations of the different modalities to per-frame highlight scores based on the representativeness of the frames. We use these scores to compute which frames to highlight and stitch contiguous frames to produce the excerpts. We train our network on the large-scale AVA-Kinetics action dataset and evaluate it on four benchmark video highlight datasets: DSH, TVSum, PHD2, and SumMe. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods in these datasets, without requiring any user-provided preferences or dataset-specific fine-tuning.
翻译:我们展示了一种基于域和用户的偏好-不可知性方法,以探测以人为中心的视频中的亮点摘录。我们的方法是在视频中以图表为基础代表多种可观测的以人为中心的模式,例如脸和脸部。我们使用一个配备空间时空图相形变的自动编码器网络,以根据这些模式探测人类的活动和互动。我们培训我们的网络,以绘制不同模式的活动和互动潜在结构图示,以每个框架根据框架的代表性突出分数。我们用这些分数来计算哪些框架以突出和缝合毗连框架来制作节录。我们用大规模AVA-Kinetical动作数据集来培训我们的网络,并用四个基准视频亮点数据集:DSH、TVSum、PHD2和SumMMe。我们观察到这些数据集中与人附加注释的亮点相对比的状态方法的平均精确度提高了4-12%,而不需要用户提供的任何偏好或特定数据集的微调。