The success of deep learning on video Action Recognition (AR) has motivated researchers to progressively promote related tasks from the coarse level to the fine-grained level. Compared with conventional AR which only predicts an action label for the entire video, Temporal Action Detection (TAD) has been investigated for estimating the start and end time for each action in videos. Taking TAD a step further, Spatiotemporal Action Detection (SAD) has been studied for localizing the action both spatially and temporally in videos. However, who performs the action, is generally ignored in SAD, while identifying the actor could also be important. To this end, we propose a novel task, Actor-identified Spatiotemporal Action Detection (ASAD), to bridge the gap between SAD and actor identification. In ASAD, we not only detect the spatiotemporal boundary for instance-level action but also assign the unique ID to each actor. To approach ASAD, Multiple Object Tracking (MOT) and Action Classification (AC) are two fundamental elements. By using MOT, the spatiotemporal boundary of each actor is obtained and assigned to a unique actor identity. By using AC, the action class is estimated within the corresponding spatiotemporal boundary. Since ASAD is a new task, it poses many new challenges that cannot be addressed by existing methods: i) no dataset is specifically created for ASAD, ii) no evaluation metrics are designed for ASAD, iii) current MOT performance is the bottleneck to obtain satisfactory ASAD results. To address those problems, we contribute to i) annotate a new ASAD dataset, ii) propose ASAD evaluation metrics by considering multi-label actions and actor identification, iii) improve the data association strategies in MOT to boost the MOT performance, which leads to better ASAD results. The code is available at https://github.com/fandulu/ASAD.
翻译:在视频行动识别(AR) 上深层学习的成功激励了研究人员逐步促进相关任务,从粗糙层面到精细层面。 与常规AR(常规AR)相比,常规AR(仅预测整个视频的动作标签 ), 时间行动探测(TAD) 为估计每次动作的开始和结束时间时间进行了调查。 进一步采用TAD( Spatotomothal Action Explication (SAD), 为在视频中将行动在空间和时间上定位而进行了研究。 然而,谁执行该动作,一般在SAD中被忽略,而确定行为者也可能很重要。为此,我们提出了一个新颖的任务,即ADO(AD) 确定多功能动作动作检测(AAAAD) 和动作识别(TOD), 具体地说, ASAAD(MA) 分析结果(MAD) 而不是新的动作(ASAAD), 具体地说, ASAD(MAAD) 算出新的动作。