Current methods for video analysis often extract frame-level features using pre-trained convolutional neural networks (CNNs). Such features are then aggregated over time e.g., by simple temporal averaging or more sophisticated recurrent neural networks such as long short-term memory (LSTM) or gated recurrent units (GRU). In this work we revise existing video representations and study alternative methods for temporal aggregation. We first explore clustering-based aggregation layers and propose a two-stream architecture aggregating audio and visual features. We then introduce a learnable non-linear unit, named Context Gating, aiming to model interdependencies among network activations. Our experimental results show the advantage of both improvements for the task of video classification. In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.
翻译:目前的视频分析方法往往利用预先训练的进化神经网络(CNNs)抽取框架级特征。这些特征随后通过简单的时间平均或更先进的经常性神经网络(如长期短期内存或封闭式经常性单元(GRU))来汇总,在这项工作中,我们修订现有的视频演示并研究时间汇总的替代方法。我们首先探索基于集群的聚合层,并提议一个汇集视听特征的双流结构。然后我们引入一个可学习的非线性单元,名为“背景定位”,旨在模拟网络激活之间的相互依存关系。我们的实验结果显示了两种改进对于视频分类任务的优势。特别是,我们评估了我们关于大型多式Youtube-8M v2数据集的方法,并超越了Youtube 8M大型视频理解挑战中的所有其他方法。