Designing a real-time framework for the spatio-temporal action detection task is still a challenge. In this paper, we propose a novel real-time action detection framework, YOWOv2. In this new framework, YOWOv2 takes advantage of both the 3D backbone and 2D backbone for accurate action detection. A multi-level detection pipeline is designed to detect action instances of different scales. To achieve this goal, we carefully build a simple and efficient 2D backbone with a feature pyramid network to extract different levels of classification features and regression features. For the 3D backbone, we adopt the existing efficient 3D CNN to save development time. By combining 3D backbones and 2D backbones of different sizes, we design a YOWOv2 family including YOWOv2-Tiny, YOWOv2-Medium, and YOWOv2-Large. We also introduce the popular dynamic label assignment strategy and anchor-free mechanism to make the YOWOv2 consistent with the advanced model architecture design. With our improvement, YOWOv2 is significantly superior to YOWO, and can still keep real-time detection. Without any bells and whistles, YOWOv2 achieves 87.0 % frame mAP and 52.8 % video mAP with over 20 FPS on the UCF101-24. On the AVA, YOWOv2 achieves 21.7 % frame mAP with over 20 FPS. Our code is available on https://github.com/yjh0410/YOWOv2.
翻译:为时空行动探测任务设计实时框架仍然是一个挑战。 在本文中, 我们提出一个新的实时实时行动探测框架( YOWOOv2 ) 。 在这个新框架中, YOWOOv2 利用三维主干和二维主干来进行准确行动探测。 多级探测管道的设计是为了检测不同规模的行动。 为了实现这一目标, 我们仔细建立一个简单高效的二维主干网, 并配有功能性金字塔网络, 以提取不同等级的分类特征和回归特征。 对于三维主干网, 我们采用现有的高效的 3D CNN 来节省发展时间。 通过将三维骨和不同大小的二维主干结合起来, 我们设计了一个包括 YOWOV2- Tiny、 YOWOv2- Mediumium 和 YOOOOOOV2 的多功能。 我们还引入了流行的动态标签分配战略和锁定机制, 使 YOOOOV2 与高级模型设计一致。 随着我们的改进, YOOOOVO 大大优优于 YO 20 和FOVO 框架, 实现实时检测。