Temporal Action Detection(TAD) is a crucial but challenging task in video understanding.It is aimed at detecting both the type and start-end frame for each action instance in a long, untrimmed video.Most current models adopt both RGB and Optical-Flow streams for the TAD task. Thus, original RGB frames must be converted manually into Optical-Flow frames with additional computation and time cost, which is an obstacle to achieve real-time processing. At present, many models adopt two-stage strategies, which would slow the inference speed down and complicatedly tuning on proposals generating.By comparison, we propose a one-stage anchor-free temporal localization method with RGB stream only, in which a novel Newtonian \emph{Mechanics-MLP} architecture is established. It has comparable accuracy with all existing state-of-the-art models, while surpasses the inference speed of these methods by a large margin. The typical inference speed in this paper is astounding 4.44 video per second on THUMOS14. In applications, because there is no need to convert optical flow, the inference speed will be faster.It also proves that \emph{MLP} has great potential in downstream tasks such as TAD. The source code is available at \url{https://github.com/BonedDeng/TadML}
翻译:在视频理解中,时间行动探测(TAD)是一项至关重要但具有挑战性的任务。 它的目的是在长长的、不剪动的视频中检测每个动作实例的类型和起始端框架。 多数当前模型为 TAD 任务同时采用 RGB 和 光学- 光向流流。 因此, 原始 RGB 框架必须手工转换成光学- 光花框架, 并增加计算和时间成本, 这是实现实时处理的一个障碍。 目前, 许多模型都采用两阶段战略, 这会减缓推断速度的下降速度, 并对生成的提案进行复杂的调试调。 By 比较, 我们建议使用一个仅使用 RGB 流的一站级无锚时间本地化方法, 其中将建立新颖的 Newtonian 和 光学- flow 流结构。 它与所有现有的状态- 艺术模型具有可比性, 并且大大超过这些方法的推断速度。 本文典型的推导速度是THUMOS14。 在应用程序中, 不需要将光学流/ 快速转换为TRAD 源。