Video action detection (spatio-temporal action localization) is usually the starting point for human-centric intelligent analysis of videos nowadays. It has high practical impacts for many applications across robotics, security, healthcare, etc. The two-stage paradigm of Faster R-CNN inspires a standard paradigm of video action detection in object detection, i.e., firstly generating person proposals and then classifying their actions. However, none of the existing solutions could provide fine-grained action detection to the "who-when-where-what" level. This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions spatially (by predicting the associated target IDs and locations) and temporally (by predicting the time in exact frame indices). This solution won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC).
翻译:视频行动探测(Spatio-时间行动定位)通常是目前以人为中心的对视频进行智能分析的起点,对机器人、安全、医疗保健等许多应用具有高度的实际影响。 快速R-CNN的两阶段范式激励了在物体探测中视频行动探测的标准范式,即首先产生个人建议,然后对其行动进行分类。然而,现有的解决方案没有一个能为“何时何人”提供精细的动作探测。本文提出了一个基于跟踪的解决方案,以准确和高效地在空间(通过预测相关目标标识和位置)和时间(通过预测准确框架指数的时间)将预先确定的关键行动定位准确和时间(通过预测准确框架指数的时间)。 这一解决方案在2021年低能计算机愿景挑战(LPCVC)的UAVideo轨道上赢得了第一位。