Current methods for spatiotemporal action tube detection often extend a bounding box proposal at a given keyframe into a 3D temporal cuboid and pool features from nearby frames. However, such pooling fails to accumulate meaningful spatiotemporal features if the position or shape of the actor shows large 2D motion and variability through the frames, due to large camera motion, large actor shape deformation, fast actor action and so on. In this work, we aim to study the performance of cuboid-aware feature aggregation in action detection under large action. Further, we propose to enhance actor feature representation under large motion by tracking actors and performing temporal feature aggregation along the respective tracks. We define the actor motion with intersection-over-union (IoU) between the boxes of action tubes/tracks at various fixed time scales. The action having a large motion would result in lower IoU over time, and slower actions would maintain higher IoU. We find that track-aware feature aggregation consistently achieves a large improvement in action detection performance, especially for actions under large motion compared to the cuboid-aware baseline. As a result, we also report state-of-the-art on the large-scale MultiSports dataset.
翻译:目前对瞬间行动管进行探测的方法往往将特定键盘上的捆绑框提案扩展至3D时候幼崽和附近框架的集合特征。然而,如果演员的位置或形状显示2D运动和通过框架的变化性,则这种集合无法积累有意义的时空特征,因为大型相机运动、大型行为者形状变形、快速行为者动作等等,在这项工作中,我们的目标是研究在大型行动中行动探测中幼虫-水分特征集合的性能。此外,我们提议通过跟踪行为者和沿各个轨道进行时空特征聚合,在大型动作下加强行为者特征表现。我们在不同固定时间尺度的动作管/轨道箱之间用交叉连接(IoU)来界定行为者动作。如果动作大动作将导致IoU随时间变小,而动作变慢动作将保持较高的IoU。我们发现,在大型行动探测基线下的行动性能持续得到大幅改进,特别是大型运动下的行动性能,与大型小型小型小型运动基线相比,我们报告“多波”数据。