视频Xum：视频的跨模态视觉和文本摘要 (VideoXum: Cross-modal Visual and Textural Summarization of Videos)

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.

翻译：视频摘要旨在从源视频中提炼出最重要的信息，以生成缩短的剪辑或文本叙述。传统上，不同的方法已被提出，具体取决于输出是视频还是文本，因此忽略了视觉摘要和文本摘要这两个语义相关的任务之间的相关性。本文提出了一项新的联合视频和文本摘要任务。其目标是从长视频中生成缩短的视频剪辑和相应的文本摘要，并称之为跨模态摘要。生成的缩短的视频剪辑和文本叙述应该是语义良好对齐的。为此，我们首先构建了一个大型的人工注释数据集--VideoXum（X指不同的模态）。该数据集基于ActivityNet重新注释。在筛选掉不符合长度要求的视频后，我们的新数据集中仍剩下14,001个长视频。我们重新注释的数据集中的每个视频都有人工注释的视频摘要和相应的叙述摘要。然后，我们设计了一种新颖的端到端模型--VTSUM-BILP，以应对我们提出的任务所面临的挑战。此外，我们提出了一种新的度量标准--VT-CLIPScore，以帮助评估跨模态摘要的语义一致性。所提出的模型在这个新任务上取得了很好的性能，并为未来的研究建立了基准。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【CVPR 2022】【视频检索用多模态融合Transformer】Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

专知会员服务

29+阅读 · 2022年3月6日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

多模态摘要简述

专知会员服务

149+阅读 · 2020年9月6日