Current methods for spatiotemporal action tube detection often extend a bounding box proposal at a given keyframe into a 3D temporal cuboid and pool features from nearby frames. However, such pooling fails to accumulate meaningful spatiotemporal features if the position or shape of the actor shows large 2D motion and variability through the frames, due to large camera motion, large actor shape deformation, fast actor action and so on. In this work, we aim to study the performance of cuboid-aware feature aggregation in action detection under large action. Further, we propose to enhance actor feature representation under large motion by tracking actors and performing temporal feature aggregation along the respective tracks. We define the actor motion with intersection-over-union (IoU) between the boxes of action tubes/tracks at various fixed time scales. The action having a large motion would result in lower IoU over time, and slower actions would maintain higher IoU. We find that track-aware feature aggregation consistently achieves a large improvement in action detection performance, especially for actions under large motion compared to the cuboid-aware baseline. As a result, we also report state-of-the-art on the large-scale MultiSports dataset. The Code is available at https://github.com/gurkirt/ActionTrackDetectron.
翻译:在特定键盘上,目前对瞬时动作管进行检测的方法往往将特定键盘上的捆绑箱建议扩展为3D时间幼崽和附近框架的集合特征。然而,如果演员的姿势或形状显示2D运动和通过框架的变异性,则这种集合无法积累有意义的时空特征,因为大型摄像机动作、大型行为者形状变形、快速行为者动作等,在这项工作中,我们的目标是在大型行动探测中研究幼虫-水特征集合的性能。此外,我们提议在大型运动下,通过跟踪行为者和在相关轨道上进行时间特征聚合,加强行为者的特征表现。我们在不同固定时间尺度的动作管/轨盒之间以交叉-超联合(IoU)的方式定义了演员动作动作。如果动作大动作会随着时间变化而降低IoU,而动作变慢动作会保持较高的IoU。我们发现,跟踪-aware特征汇总在大型运动下,特别是在大型运动下的行动表现得到很大的改进。作为结果,我们还报告在大型运动/Mexportrock-Stor-stal agols。