This paper investigates the challenge of extracting highlight moments from videos. To perform this task, a system needs to understand what constitutes a highlight for arbitrary video domains while at the same time being able to scale across different domains. Our key insight is that photographs taken by photographers tend to capture the most remarkable or photogenic moments of an activity. Drawing on this insight, we present Videogenic, a system capable of creating domain-specific highlight videos for a wide range of domains. In a human evaluation study (N=50), we show that a high-quality photograph collection combined with CLIP-based retrieval (which uses a neural network with semantic knowledge of images) can serve as an excellent prior for finding video highlights. In a within-subjects expert study (N=12), we demonstrate the usefulness of Videogenic in helping video editors create highlight videos with lighter workload, shorter task completion time, and better usability.
翻译:本文探讨了从视频中提取亮点片段的挑战。 为了完成这项任务, 一个系统需要理解什么是任意视频域的亮点, 同时能够跨越不同域。 我们的关键洞察力是摄影师拍摄的照片往往能捕捉活动最显著或光源片状的片段。 根据这个洞察力, 我们展示了视频源, 这个系统能够为广泛的领域创建特定域点片段视频。 在一项人类评估研究( N=50)中, 我们显示, 高质量的照片收藏与基于 CLIP 的检索( 使用具有图像语义知识的神经网络) 相结合, 可以作为寻找视频亮点的绝佳前程。 在一项学科内专家研究( N=12)中, 我们展示了视频源能帮助视频编辑制作工作量更轻、任务完成时间更短、可用性更佳的视频的有用性。