As the number of video content has mushroomed in recent years, automatic video summarization has come useful when we want to just peek at the content of the video. However, there are two underlying limitations in generic video summarization task. First, most previous approaches read in just visual features as input, leaving other modality features behind. Second, existing datasets for generic video summarization are relatively insufficient to train a caption generator and multimodal feature extractors. To address these two problems, this paper proposes the Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames. Our MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders. Then, MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores. Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method.
翻译:近年来,随着视频内容的数量激增,当我们想偷看视频内容时,自动视频摘要已经变得有用,但是,在一般视频摘要任务方面存在着两个基本限制。首先,大多数以往的方法只是视觉特征作为输入,留下其他模式特征。第二,现有的通用视频摘要数据集相对不足以培训字幕生成器和多式联运特征提取器。为了解决这两个问题,本文件提议采用多式框架组合变异器框架,利用视频、文本和音频特征,并评分有关框架的视频。我们的MFST框架首先利用预先培训的编码器提取每种模式特征(视频文本和音频)。然后,MFST培训多式框架组合变异器,将视频文本显示为输入,预测框架等级分数。我们用以往模型进行的广泛实验,并在TVSum和SumMEMe数据集进行对比研究,展示了我们拟议方法的有效性和优越性。