Video summarization methods are usually classified into shot-level or frame-level methods, which are individually used in a general way. This paper investigates the underlying complementarity between the frame-level and shot-level methods, and a stacking ensemble approach is proposed for supervised video summarization. Firstly, we build up a stacking model to predict both the key frame probabilities and the temporal interest segments simultaneously. The two components are then combined via soft decision fusion to obtain the final scores of each frame in the video. A joint loss function is proposed here to train the model. The ablation experimental results show that the proposed method outperforms both the two corresponding individual method. Furthermore, extensive experiments and analysis on two benchmark datasets demonstrate the effectiveness of our method and its superior performance in comparison with the state-of-the-art methods.
翻译:视频总和方法通常分为射线级或框架级方法,这些方法一般都单独使用。本文调查框架级和射击级方法之间的基本互补性,并提议对监督视频总和采用堆叠式混合法。首先,我们建立一个堆叠式模型,同时预测关键框架概率和时间利益区段。然后,通过软决定组合,将这两个组成部分合并,以获得视频中每个框架的最后分数。在此提议一个联合损失功能来培训模型。减缩实验结果显示,拟议的方法优于两种相应的单独方法。此外,对两个基准数据集的广泛实验和分析表明,我们的方法及其优异性与最新方法相比是有效的。