Action recognition from videos, i.e., classifying a video into one of the pre-defined action types, has been a popular topic in the communities of artificial intelligence, multimedia, and signal processing. However, existing methods usually consider an input video as a whole and learn models, e.g., Convolutional Neural Networks (CNNs), with coarse video-level class labels. These methods can only output an action class for the video, but cannot provide fine-grained and explainable cues to answer why the video shows a specific action. Therefore, researchers start to focus on a new task, Part-level Action Parsing (PAP), which aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video. To this end, we propose a coarse-to-fine framework for this challenging task. In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action. Moreover, to balance the accuracy and computation in part-level action parsing, we propose to recognize the part-level actions by segment-level features. Furthermore, to overcome the ambiguity of body parts, we propose a pose-guided positional embedding method to accurately localize body parts. Through comprehensive experiments on a large-scale dataset, i.e., Kinetics-TPS, our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
翻译:视频中的动作识别,即将视频归类为预先定义的行动类型之一,在人工智能、多媒体和信号处理等群体中,这是一个受欢迎的话题。然而,现有方法通常将输入视频视为一个整体,并学习模型,例如,动态神经网络(CNNs),其视频级别标签粗糙。这些方法只能为视频输出一个行动类别,但不能提供精确和可解释的提示来解答视频显示具体行动的原因。因此,研究人员开始关注一个新的任务,即“部分层面行动剖析”(PAP),其目标不仅是预测视频层面的行动,而且还承认视频中每个人的架构级精细微动作或身体部分的互动。为此,我们为这项具有挑战性的任务提出了一个粗略到模糊的框架。特别是,我们的框架首先预测了输入视频的视频级别级别,然后将机构部分本地化,并预测了部分的行动。此外,为了平衡视频层面的行动的准确性和计算,我们通过部分的准确度,我们提议了在内部结构层面的某个部分上,我们提出了一个跨层次的层次的层次,我们提出了一个跨层次的层次的层次的动作。