We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code will be made publicly available in PyTorch.
翻译:我们展示了“慢速”视频识别网络。我们的模型包括 (一) 慢路,以低框架速率运行,以捕捉空间语义学,和 (二) 快速通道,以高框架速率运行,以精确的时间分辨率捕捉运动。快速通道可以通过降低频道容量而变得非常轻,但可以学习有用的时间信息来进行视频识别。我们的模型在视频行动分类和检测两方面都取得了显著的绩效,而巨大的改进被我们慢框架速率概念的贡献所点出。 我们报告的主要视频识别基准、动因学、剧场和AVA的最新准确性。 代码将在PyTorch上公布。