Spatial-temporal action detection is a vital part of video understanding. Current spatial-temporal action detection methods will first use an object detector to obtain person candidate proposals. Then, the model will classify the person candidates into different action categories. So-called two-stage methods are heavy and hard to apply in real-world applications. Some existing methods use a unified model structure, But they perform badly with the vanilla model and often need extra modules to boost the performance. In this paper, we explore the strategy to build an end-to-end spatial-temporal action detector with minimal modifications. To this end, we propose a new method named ME-STAD, which solves the spatial-temporal action detection problem in an end-to-end manner. Besides the model design, we propose a novel labeling strategy to deal with sparse annotations in spatial-temporal datasets. The proposed ME-STAD achieves better results (2.2% mAP boost) than original two-stage detectors and around 80% FLOPs reduction. Moreover, our proposed ME-STAD only has minimum modifications with previous methods and does not require extra components. Our code will be made public.
翻译:空间时空动作探测是视频理解的一个重要部分。 目前的空间时空动作探测方法将首先使用物体探测器来获取个人候选建议。 然后, 该模型将个人候选人分为不同的行动类别。 所谓的两阶段方法在现实应用中是沉重的, 很难应用。 一些现有的方法使用统一的模型结构, 但是它们与香草模型不相符, 通常需要额外的模块来提升性能。 在本文中, 我们探索建立一个终端到终端空间时空动作探测器的战略, 且只有最小的修改。 为此, 我们提出了一个新的方法, 名为 ME- STAD, 以端到端的方式解决空间时空动作探测问题。 除了模型设计外, 我们提出了一个新的标签战略, 处理空间时空数据集中稀少的描述。 提议的ME-STAD 取得了比最初的两阶段探测器更好的结果( 2.2% mAP 推进), 以及大约80% FLOPs 。 此外, 我们提议的ME-STAD 将仅对以往的方法进行最低限度的修改, 不需要额外的组件。