Recently, substantial research effort has focused on how to apply CNNs or RNNs to better extract temporal patterns from videos, so as to improve the accuracy of video classification. In this paper, however, we show that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common video classification datasets. We investigate the potential of a purely attention based local feature integration. Accounting for the characteristics of such features in video classification, we propose a local feature integration framework based on attention clusters, and introduce a shifting operation to capture more diverse signals. We carefully analyze and compare the effect of different attention mechanisms, cluster sizes, and the use of the shifting operation, and also investigate the combination of attention clusters for multimodal integration. We demonstrate the effectiveness of our framework on three real-world video classification datasets. Our model achieves competitive results across all of these. In particular, on the large-scale Kinetics dataset, our framework obtains an excellent single model accuracy of 79.4% in terms of the top-1 and 94.0% in terms of the top-5 accuracy on the validation set. The attention clusters are the backbone of our winner solution at ActivityNet Kinetics Challenge 2017. Code and models will be released soon.
翻译:最近,大量研究工作的重点是如何应用CNN或RNNS来更好地从视频中提取时间模式,从而改进视频分类的准确性。然而,在本文件中,我们表明,时间信息,特别是较长期模式,对于在通用视频分类数据集上取得竞争性结果可能并不必要。我们调查纯粹基于关注的本地特征整合的潜力。在视频分类中,考虑到这些特征的特点特点的特点,我们提议了一个基于关注群的本地特征整合框架,并引入一个转换操作以捕捉更多不同信号。我们仔细分析和比较了不同关注机制、集群大小和移动操作的使用的影响,并调查了多式联运整合关注组的组合组合组合。我们展示了我们三个真实世界视频分类数据集的框架的有效性。我们的模式取得了所有这一切的竞争性结果。特别是在大型基力数据集中,我们的框架在验证数据集上层-1和上层-5准确度的94.0%方面获得了极好的单一模型准确度。关注组将很快成为我们在企业网络Kineticari Challenge 2017年推出的游戏模型。