Video transcript summarization is a fundamental task for video understanding. Conventional approaches for transcript summarization are usually built upon the summarization data for written language such as news articles, while the domain discrepancy may degrade the model performance on spoken text. In this paper, we present VT-SSum, a benchmark dataset with spoken language for video transcript segmentation and summarization, which includes 125K transcript-summary pairs from 9,616 videos. VT-SSum takes advantage of the videos from VideoLectures.NET by leveraging the slides content as the weak supervision to generate the extractive summary for video transcripts. Experiments with a state-of-the-art deep learning approach show that the model trained with VT-SSum brings a significant improvement on the AMI spoken text summarization benchmark. VT-SSum will be publicly available to support the future research of video transcript segmentation and summarization tasks.
翻译:记录誊本摘要是了解视频的一项基本任务。记录誊本摘要的常规方法通常以新闻文章等书面语言的汇总数据为基础,而域差可能会降低口述文本的示范性能。在本文中,我们提供了VT-SSum,这是一个带有口语的基准数据集,用于录像誊本分解和汇总,其中包括9 616个视频的125K条记录摘要。VT-SSum利用视频摘要的视频内容。NET利用幻灯片内容作为薄弱的监管,生成视频誊本的采掘摘要。以最先进的深层次学习方法进行的实验表明,用VT-SSum培训的模型大大改进了AMI语言文字摘要化基准。VT-SSum将向公众开放,以支持今后对录像记录分解和汇总任务的研究。