Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This paper provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this paper also reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.
翻译:理解人类行为和活动有助于推进众多真实世界应用,对于视频分析至关重要。尽管在剪裁视频中的行动识别算法取得了进展,但大多数真实世界视频都是冗长的,没有引起兴趣的稀疏部分。未剪裁视频中的时间活动探测任务旨在确定行动的时间界限和对行动类别进行分类。根据行动说明的提供情况,对时间活动检测任务进行了全面且有限的监督环境调查。本文件广泛概述了深层次的基于学习的算法,以解决未剪裁视频中的时间行动检测问题,这些视频具有不同的监督级别,包括完全监督、薄弱监督、不受监督、自我监督、自我监督以及半监督。此外,本文还回顾了在行动在时间和空间两个层面都具有本地特征的时空行动检测的进展。此外,还介绍了常用的行动检测基准数据集和评价指标,并比较了最新方法的绩效。最后,在不固定视频和一系列未来方向中真实地应用了时间行动检测。