In recent years multi-label, multi-class video action recognition has gained significant popularity. While reasoning over temporally connected atomic actions is mundane for intelligent species, standard artificial neural networks (ANN) still struggle to classify them. In the real world, atomic actions often temporally connect to form more complex composite actions. The challenge lies in recognising composite action of varying durations while other distinct composite or atomic actions occur in the background. Drawing upon the success of relational networks, we propose methods that learn to reason over the semantic concept of objects and actions. We empirically show how ANNs benefit from pretraining, relational inductive biases and unordered set-based latent representations. In this paper we propose deep set conditioned I3D (SCI3D), a two stream relational network that employs latent representation of state and visual representation for reasoning over events and actions. They learn to reason about temporally connected actions in order to identify all of them in the video. The proposed method achieves an improvement of around 1.49% mAP in atomic action recognition and 17.57% mAP in composite action recognition, over a I3D-NL baseline, on the CATER dataset.
翻译:近些年来,多标签、多级视频动作的承认已获得显著的欢迎。虽然对时间连接原子行动的推理对智能物种来说是普通的,但标准的人工神经网络(ANN)仍然难以将其分类。在现实世界中,原子行动往往在时间上连在一起,形成更为复杂的复合行动。挑战在于承认不同时间的复合行动,而其他不同的复合或原子行动则在背景中出现。根据关系网络的成功,我们建议一些方法来理解物体和行动的语义概念。我们从经验上表明,在原子行动识别和17.57%的 mAP综合行动识别方面,ACTER基线上的I3D-NL基线上,ANDANN3D(SCID)和I3D-NL数据识别中,ANAP的近1.49% m57% mAP得到了改进。