As the number of video content has mushroomed in recent years, automatic video summarization has come useful when we want to just peek at the content of the video. However, there are two underlying limitations in generic video summarization task. First, most previous approaches read in just visual features as input, leaving other modality features behind. Second, existing datasets for generic video summarization are relatively insufficient to train a caption generator used for extracting text information from a video and to train the multimodal feature extractors. To address these two problems, this paper proposes the Multimodal Frame-Scoring Transformer (MFST), a framework exploiting visual, text, and audio features and scoring a video with respect to frames. Our MFST framework first extracts each modality features (audio-visual-text) using pretrained encoders. Then, MFST trains the multimodal frame-scoring transformer that uses multimodal representation based on extracted features as inputs and predicts frame-level scores. Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method by a large margin in both F1 score and Rank-based evaluation.
翻译:近年来,随着视频内容的数量激增,当我们想偷看视频内容时,自动视频摘要已经变得有用,但是,在一般视频摘要任务方面存在着两个基本限制。首先,大多数以往的方法只是视觉特征作为输入,留下其他模式特征。第二,现有的通用视频摘要数据集相对不足以培训用于从视频中提取文本信息并培训多式联运特征提取器的字幕生成器。为了解决这两个问题,本文件提议采用多式框架组合变异器(MFST),这是一个利用视觉、文字和音频特征并评分与框架有关的视频的框架框架。我们的MFST框架框架首先利用预先培训的编码器提取每一种模式特征(audio-visual-text)。然后,MFST培训多式联运框架组合变异器,利用提取的特征作为输入和预测框架级分。我们用以往模型进行的广泛实验,并对TVSSum和SumMame数据集进行放大研究,展示了我们拟议方法的有效性和优越性,在F1级和级评价中以大幅度为基础。