Video understanding has attracted much research attention especially since the recent availability of large-scale video benchmarks. In this paper, we address the problem of multi-label video classification. We first observe that there exists a significant knowledge gap between how machines and humans learn. That is, while current machine learning approaches including deep neural networks largely focus on the representations of the given data, humans often look beyond the data at hand and leverage external knowledge to make better decisions. Towards narrowing the gap, we propose to incorporate external knowledge graphs into video classification. In particular, we unify traditional "knowledgeless" machine learning models and knowledge graphs in a novel end-to-end framework. The framework is flexible to work with most existing video classification algorithms including state-of-the-art deep models. Finally, we conduct extensive experiments on the largest public video dataset YouTube-8M. The results are promising across the board, improving mean average precision by up to 2.9%.
翻译:特别是自最近提供大规模视频基准以来,视频理解吸引了许多研究关注,特别是自最近提供大型视频基准以来。在本文中,我们讨论了多标签视频分类问题。我们首先观察到机器和人类学习方式之间存在巨大的知识差距。也就是说,尽管目前的机器学习方法,包括深神经网络,主要侧重于特定数据的表达方式,但人类往往超越手头的数据,利用外部知识做出更好的决定。为了缩小差距,我们提议将外部知识图表纳入视频分类。特别是,我们将传统的“无知识”机器学习模式和知识图表统一到一个新的端对端框架。这个框架灵活地与大多数现有的视频分类算法合作,包括最新的深层模型。最后,我们对最大的公开视频数据集YouTube-8M进行了广泛的实验。结果在各个方面很有希望,平均精确率提高到2.9%。