Although video summarization has achieved tremendous success benefiting from Recurrent Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi-hop relationships among video frames, which limits the performance. Transformer is an effective model to deal with this problem, and surpasses RNN-based methods in several sequence modeling tasks, such as machine translation, video captioning, \emph{etc}. Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization, which can capture the dependencies among frame and shots, and summarize the video by exploiting the scene information formed by shots. Furthermore, we argue that both the audio and visual information are essential for the video summarization task. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. In this paper, the proposed method is denoted as Hierarchical Multimodal Transformer (HMT). Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
翻译:虽然视频总和已经取得了巨大的成功,得益于经常的神经网络(Neal Networks (RNN),但基于RNN的方法忽视了视频框架之间的全球依赖性和多视关系,从而限制了性能。变异器是解决这一问题的有效模式,在机器翻译、视频字幕、emph{etc}等若干序列建模任务中超过了基于RNN的方法。由于变压器和视频自然结构(框架光碟-视频)的巨大成功,因此为视频总和开发了一个等级变压器,可以捕捉框架和镜头之间的依赖性,通过利用镜头生成的场景信息来总结视频。此外,我们认为,变异器和视觉信息对于视频总和任务都是必不可少的。为了整合这两种信息,它们被编码成双流制成,而基于等级变压器的多式联运聚变机制也得到了开发。在本文中,拟议的方法被记为高等级多调制变压器(HMT),可以捕捉到框架和镜头之间的依赖性,并且通过利用镜头生成的场景象信息来对视频信息进行总结,从而显示HMTMT超越了传统、RNNNPM的最关注。