We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report 79.0% accuracy on the Kinetics dataset without using any pre-training, largely surpassing the previous best results of this kind. On AVA action detection we achieve a new state-of-the-art of 28.3 mAP. Code will be made publicly available.
翻译:我们提出“慢速”视频识别网络。我们的模型包括 (一) 慢速路径,以低框架速率运行,以捕捉空间语义学,和 (二) 快速路径,以高框架速率运行,以精确的时间分辨率捕捉运动。快速路径可以通过降低频道容量而变得非常轻,但可以学习有用的时间信息来进行视频识别。我们的模型在视频行动分类和检测两方面都取得了很强的性能,而大量改进被作为“慢框架”概念的贡献。 我们报告动因数据集的准确率79.0%,而没有使用任何预培训,大大超过先前的最佳结果。 在AVA行动探测中,我们将实现28.3 mAP的新艺术状态。代码将公布于众。