We introduce Spatial-Temporal Memory Networks for video object detection. At its core, a novel Spatial-Temporal Memory module (STMM) serves as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM's design enables full integration of pretrained backbone CNN weights, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. Our method produces state-of-the-art results on the benchmark ImageNet VID dataset, and our ablative studies clearly demonstrate the contribution of our different design choices. We release our code and models at http://fanyix.cs.ucdavis.edu/project/stmn/project.html.
翻译:我们引入了用于视频天体探测的空间-时记忆网络。 在其核心方面,一个新的空间- 时记忆模块(STMM)是用于模拟长期时间外观和运动动态的经常性计算单位。 STMM的设计能够充分整合我们发现对准确探测至关重要的经过培训的骨干CNN重量。 此外,为了在视频中处理物体动作,我们提议了一个新的MatchTrans模块,将空间- 时积记忆从框架到框架协调起来。 我们的方法产生了基准图像网VID数据集的最新结果,我们的模拟研究清楚地展示了我们不同设计选择的贡献。 我们在http://fanyix.cs.ucdavis.edu/project/stmn/project.html上发布了我们的代码和模型。