Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.
翻译:视频摘要旨在从源视频中提取最重要的信息,以生成缩短的剪辑或文本叙述。传统上,根据输出是否是视频或文本,提出了不同的方法,因此忽略了视觉摘要和文本摘要这两个语义相关任务之间的关联。我们提出了一个新的联合视频和文本摘要任务。目标是从长视频生成缩短的视频剪辑和相应的文本摘要,共同称为跨模态摘要。生成的缩短的视频剪辑和文本叙述应该语义上对齐。为此,我们首先构建了一个大规模的人工标注数据集--VideoXum(X代表不同的模态)。该数据集基于ActivityNet重新注释,并过滤掉不满足长度要求的视频,最终留下了14,001个视频。我们的重新注释数据集中每个视频都有人工注释的视频摘要和相应的叙述摘要。然后,我们设计了一种新的端到端模型——VTSUM-BILP来解决我们提出的任务的挑战。此外,我们提出了一种名为VT-CLIPScore的新指标来帮助评估跨模态摘要的语义一致性。所提出的模型在这项新任务上实现了令人满意的表现,并为未来的研究建立了基准。