With vast amounts of video content being uploaded to the Internet every minute, video summarization becomes critical for efficient browsing, searching, and indexing of visual content. Nonetheless, the spread of social and egocentric cameras creates an abundance of sparse scenarios captured by several devices, and ultimately required to be jointly summarized. In this paper, we discuss the problem of summarizing videos recorded independently by several dynamic cameras that intermittently share the field of view. We present a robust framework that (a) identifies a diverse set of important events among moving cameras that often are not capturing the same scene, and (b) selects the most representative view(s) at each event to be included in a universal summary. Due to the lack of an applicable alternative, we collected a new multi-view egocentric dataset, Multi-Ego. Our dataset is recorded simultaneously by three cameras, covering a wide variety of real-life scenarios. The footage is annotated by multiple individuals under various summarization configurations, with a consensus analysis ensuring a reliable ground truth. We conduct extensive experiments on the compiled dataset in addition to three other standard benchmarks that show the robustness and the advantage of our approach in both supervised and unsupervised settings. Additionally, we show that our approach learns collectively from data of varied number-of-views and orthogonal to other summarization methods, deeming it scalable and generic.
翻译:随着大量视频内容每分钟上传到互联网上,视频摘要化对于有效浏览、搜索和索引视觉内容至关重要。然而,社交和以自我为中心的相机的传播创造了由若干装置所捕捉的、最终需要共同总结的大量稀少的场景。在本文中,我们讨论了由几部动态相机独立录制的视频的总结问题,这些摄像机间歇性地共享视野领域。我们提出了一个强有力的框架,(a) 确定移动相机中往往不捕捉同一场景的一套不同的重要事件,(b) 在每场活动中选择最有代表性的观点,以纳入一个通用摘要。由于缺乏一个可适用的替代方案,我们收集了一个新的多视角以自我为中心的数据集。我们的数据集由三部摄像机同时录制,涵盖各种各样的真实生活场景。视频由不同组合下的许多个人加注,通过协商一致的分析确保可靠的地面真相。我们在汇编的数据集上进行了广泛的实验,此外还有其他三个标准基准,显示我们通用方法的稳健性和优势。我们从总体方法到总体理解中学习了其他方法。