Fine-grained human action recognition is a core research topic in computer vision. Inspired by the recently proposed hierarchy representation of fine-grained actions in FineGym and SlowFast network for action recognition, we propose a novel multi-task network which exploits the FineGym hierarchy representation to achieve effective joint learning and prediction for fine-grained human action recognition. The multi-task network consists of three pathways of SlowOnly networks with gradually increased frame rates for events, sets and elements of fine-grained actions, followed by our proposed integration layers for joint learning and prediction. It is a two-stage approach, where it first learns deep feature representation at each hierarchical level, and is followed by feature encoding and fusion for multi-task learning. Our empirical results on the FineGym dataset achieve a new state-of-the-art performance, with 91.80% Top-1 accuracy and 88.46% mean accuracy for element actions, which are 3.40% and 7.26% higher than the previous best results.
翻译:精细的人类行动认知是计算机愿景的核心研究课题之一。 受最近提出的精细的精细行为在FineGym 和 Slowfast 网络中的等级代表性的启发,我们提议建立一个创新的多任务网络,利用精益Gym 的等级代表性,实现有效的联合学习和预测,以微细的人类行动认知。多任务网络由三个途径组成:慢轨道网络,其事件、组合和精细行动要素的框架率逐步提高,随后是我们提议的联合学习和预测的整合层。这是一个两阶段方法,它首先学习每个层次的深层特征代表性,随后是多任务学习的特征编码和聚合。 我们在精益Gym 数据集上的经验结果取得了一种新的状态,91.8 % 的顶层-1 精确度和88.46%的元素动作平均精度,比前一个最佳结果高出3.40%和7.26%。