Video summarization intends to produce a concise video summary by effectively capturing and combining the most informative parts of the whole content. Existing approaches for video summarization regard the task as a frame-wise keyframe selection problem and generally construct the frame-wise representation by combining the long-range temporal dependency with the unimodal or bimodal information. However, the optimal video summaries need to reflect the most valuable keyframe with its own information, and one with semantic power of the whole content. Thus, it is critical to construct a more powerful and robust frame-wise representation and predict the frame-level importance score in a fair and comprehensive manner. To tackle the above issues, we propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation via combining the comprehensive available multimodal information. Specifically, we design a hierarchical ShotConv network to incorporate the adaptive shot-aware frame-level representation by considering the short-range and long-range temporal dependency. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video. Extensive experiments on two standard video summarization datasets demonstrate that our proposed method consistently outperforms state-of-the-art baselines. Source code will be made publicly available.
翻译:现有视频摘要方法将这项任务视为一个框架性关键框架选择问题,并一般地通过将远程时间依赖与单一模式或双模式信息相结合来构建框架性代表制。然而,最佳视频摘要需要以自己的信息反映最宝贵的关键框架,并具有整个内容的语义性代表制。因此,至关重要的是,要构建一个更强大和强有力的框架性代表制,并以公平和全面的方式预测框架级重要性得分。为了解决上述问题,我们建议建立一个多式的、等级级的、有远见的革命共振网络,称为MHSCNet,通过综合现有综合多式联运信息,加强框架性代表制。具体地说,我们设计了一个等级的Convon网络,以考虑到短程和长程时间性代表制。因此,MHSCNet能够根据所学的直观性框架性代表制,以公平和全面的方式预测地方和全球对框架级重要得分。我们提出的多层次、有远见的革命性革命性网络,称为MHSCN,以MHSCN为代号为标志,通过综合综合现有标准原始数据系统,将展示现有的两种标准原始数据。