Video summarization aims to automatically generate a summary (storyboard or video skim) of a video, which can facilitate large-scale video retrieval and browsing. Most of the existing methods perform video summarization on individual videos, which neglects the correlations among similar videos. Such correlations, however, are also informative for video understanding and video summarization. To address this limitation, we propose Video Joint Modelling based on Hierarchical Transformer (VJMHT) for co-summarization, which takes into consideration the semantic dependencies across videos. Specifically, VJMHT consists of two layers of Transformer: the first layer extracts semantic representation from individual shots of similar videos, while the second layer performs shot-level video joint modelling to aggregate cross-video semantic information. By this means, complete cross-video high-level patterns are explicitly modelled and learned for the summarization of individual videos. Moreover, Transformer-based video representation reconstruction is introduced to maximize the high-level similarity between the summary and the original video. Extensive experiments are conducted to verify the effectiveness of the proposed modules and the superiority of VJMHT in terms of F-measure and rank-based evaluation.
翻译:视频总和旨在自动生成一个视频摘要(故事板或视频片段),这可以促进大规模视频检索和浏览; 多数现有方法在单个视频中进行视频总和,忽视了类似视频的相互关系; 然而,这种关联性也为视频理解和视频总和提供了信息; 为解决这一限制,我们提议根据等级变换器(VJMHT)进行视频联合建模,以共同汇总,同时考虑到视频之间的语义依赖性; 具体地说,VJMHT由两层变换器组成:第一层从类似视频的单个镜头中提取语义代表,而第二层则进行射击级视频联合建模,以综合跨视频的语义信息; 通过这一方法,为个人视频总和化明确模拟和学习了完整的跨视频高层次模式; 此外,还引入了基于变换器的视频代表制,以最大限度地实现摘要和原始视频之间的高度相似性; 进行了广泛的实验,以核实拟议模式的有效性和VMYHT的等级评估标准。