In this work we propose a novel method for supervised, keyshots based video summarization by applying a conceptually simple and computationally efficient soft, self-attention mechanism. Current state of the art methods leverage bi-directional recurrent networks such as BiLSTM combined with attention. These networks are complex to implement and computationally demanding compared to fully connected networks. To that end we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training. Our method sets a new state of the art results on two benchmarks TvSum and SumMe, commonly used in this domain.
翻译:在这项工作中,我们建议采用一种新的方法,通过采用概念简单和计算高效的软式自省机制,对视频进行监听、按键汇总。目前,先进方法利用双向重复式网络,如BILSTM和注意力相结合。这些网络与完全连接的网络相比,实施和计算要求复杂。为此,我们提出一个简单、以自省为基础的视频汇总网络,在培训期间,通过单一进料前传和单向后传,对序列转换进行全程的顺序。我们的方法为两个基准设定了新水平的艺术结果,即通常用于这一领域的TvSum和SumMe。