Real-time and online action localization in a video is a critical yet highly challenging problem. Accurate action localization requires the utilization of both temporal and spatial information. Recent attempts achieve this by using computationally intensive 3D CNN architectures or highly redundant two-stream architectures with optical flow, making them both unsuitable for real-time, online applications. To accomplish activity localization under highly challenging real-time constraints, we propose utilizing fast and efficient key-point based bounding box prediction to spatially localize actions. We then introduce a tube-linking algorithm that maintains the continuity of action tubes temporally in the presence of occlusions. Further, we eliminate the need for a two-stream architecture by combining temporal and spatial information into a cascaded input to a single network, allowing the network to learn from both types of information. Temporal information is efficiently extracted using a structural similarity index map as opposed to computationally intensive optical flow. Despite the simplicity of our approach, our lightweight end-to-end architecture achieves state-of-the-art frame-mAP of 74.7% on the challenging UCF101-24 dataset, demonstrating a performance gain of 6.4% over the previous best online methods. We also achieve state-of-the-art video-mAP results compared to both online and offline methods. Moreover, our model achieves a frame rate of 41.8 FPS, which is a 10.7% improvement over contemporary real-time methods.
翻译:视频中的实时和在线动作本地化是一个至关重要但极具挑战性的问题。 准确的行动本地化需要同时利用时间和空间信息。 最近试图通过使用三维CNN结构或高度冗余的光流两流结构实现这一点的尝试。 最近试图通过使用三维CNN结构或高度冗余的光流双流结构实现这一点,使这两个结构都不适合实时在线应用程序。 为了在极具挑战性的实时限制下完成活动本地化,我们建议利用基于关键点的快速高效关键点约束框预测来空间本地化行动。 我们随后引入一个管链接算法,在具有挑战性的 UCFF. 101-24 数据集中暂时保持行动管的连续性。此外,我们通过将时间和空间信息合并成一个向单一网络的连锁输入来消除双流结构的必要性,使网络能够从这两种类型的信息中学习。 与计算密集的光流相比,通过结构相似指数图有效提取时间信息。 尽管我们的方法简单,但我们的轻量端端到终端结构实现了74.7 %的状态框架AP。 在具有挑战性的 UCFCFS- 101-24 格式的当代数据集中,也展示了比前6.44%取得最佳的在线成果。