We introduce Spatial-Temporal Memory Networks (STMN) for video object detection. At its core, we propose a novel Spatial-Temporal Memory module (STMM) as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM's design enables the integration of ImageNet pre-trained backbone CNN weights for both the feature stack as well as the prediction head, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. We compare our method to state-of-the-art detectors on ImageNet VID, and conduct ablative studies to dissect the contribution of our different design choices. We obtain state-of-the-art results with the VGG backbone, and competitive results with the ResNet backbone. To our knowledge, this is the first video object detector that is equipped with an explicit memory mechanism to model long-term temporal dynamics.
翻译:我们引入了用于视频天体探测的时空内存网络(STMN) 。 在其核心部分, 我们提出一个新的时空内存模块( STMM), 作为用于模拟长期时间外观和运动动态的经常性计算单位。 STMM的设计可以整合图像网预先训练的功能性CNN重量, 包括功能堆和预测头, 我们认为这对于准确探测来说至关重要。 此外, 为了在视频中处理物体动作, 我们提出了一个新的 MatchTrans 模块, 将空间时空内存从边框到边框调。 我们比较了我们的方法和图像网VID 上的最新探测器, 并进行了调整研究, 以解析我们不同设计选择的贡献。 我们获得了与 VGG 脊柱的先进成果, 以及ResNet 脊椎的竞争结果。 据我们所知, 这是第一个配备明确记忆机制以模拟长期时间动态的视频天体探测器。