The recent growth of web video sharing platforms has increased the demand for systems that can efficiently browse, retrieve and summarize video content. Query-aware multi-video summarization is a promising technique that caters to this demand. In this work, we introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS, that jointly optimizes multiple criteria: (1) conciseness, (2) representativeness of important query-relevant events and (3) chronological soundness. We design a hierarchical attention model that factorizes over three distributions, each collecting evidence from a different modality, followed by a pointer network that selects frames to include in the summary. DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence. We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
翻译:最近网络视频共享平台的增长增加了对能够有效浏览、检索和总结视频内容的系统的需求。 Query-aware多视频汇总是一种符合这一需求的有希望的技术。 在这项工作中,我们引入了一个名为 " 深QMVS " 的新颖的多视频共享高级指针网络,它共同优化了多种标准:(1) 简洁、(2) 重要查询事件的代表性和(3) 时间顺序正确性。我们设计了一个分级关注模式,将三个分发方式的每个收集证据的分级化为分流,然后建立一个选择框架纳入摘要的指针网络。深QMVS接受强化学习培训,包括获取代表性、多样性、可调适性和时间一致性的奖励。我们在MVS1K数据集上取得了最新的结果,并用输入视频框架的数量来直线缩放时间。