Interpreting human actions requires understanding the spatial and temporal context of the scenes. State-of-the-art action detectors based on Convolutional Neural Network (CNN) have demonstrated remarkable results by adopting two-stream or 3D CNN architectures. However, these methods typically operate in a non-real-time, ofline fashion due to system complexity to reason spatio-temporal information. Consequently, their high computational cost is not compliant with emerging real-world scenarios such as service robots or public surveillance where detection needs to take place at resource-limited edge devices. In this paper, we propose ACDnet, a compact action detection network targeting real-time edge computing which addresses both efficiency and accuracy. It intelligently exploits the temporal coherence between successive video frames to approximate their CNN features rather than naively extracting them. It also integrates memory feature aggregation from past video frames to enhance current detection stability, implicitly modeling long temporal cues over time. Experiments conducted on the public benchmark datasets UCF-24 and JHMDB-21 demonstrate that ACDnet, when integrated with the SSD detector, can robustly achieve detection well above real-time (75 FPS). At the same time, it retains reasonable accuracy (70.92 and 49.53 frame mAP) compared to other top-performing methods using far heavier configurations. Codes will be available at https://github.com/dginhac/ACDnet.
翻译:以进化神经网络(CNN)为基础的最先进的行动探测器已经通过采用双流或3DCNN结构展示了显著的成果;然而,这些方法一般都是在非实时、直线时态运作,因为系统复杂,而理性的spatio-时间信息具有一定的系统复杂性。因此,它们高昂的计算成本不符合新出现的现实世界情景,例如服务机器人或公共监视,在这些情形中,需要对资源有限的边缘装置进行检测。在本文中,我们提议建立ACDnet,即针对实时边缘计算机的精密行动探测网,既能提高效率,又能准确性。它明智地利用连续的视频框架之间的时间一致性,以近似其CNN特征,而不是天真地提取这些特征。它还结合了过去视频框架的记忆特征汇总,以提高当前的检测稳定性,隐含长期模拟的长时间提示。在公共基准数据集UCFCFC-24和JHMDB-21上进行的实验表明,在与SSD70探测器结合时,ACDnet能够稳健地在实时(75PSA-AFPS-ADAx farstrial Flad)框架上进行更精确的探测。