In this paper, we propose an approach that spatially localizes the activities in a video frame where each person can perform multiple activities at the same time. Our approach takes the temporal scene context as well as the relations of the actions of detected persons into account. While the temporal context is modeled by a temporal recurrent neural network (RNN), the relations of the actions are modeled by a graph RNN. Both networks are trained together and the proposed approach achieves state of the art results on the AVA dataset.
翻译:在本文中,我们提出一种方法,在视频框架内将活动在空间上本地化,使每个人可以同时开展多种活动。我们的方法考虑到时间场景背景以及被检测到的人的行动关系。时间场景由时间性经常性神经网络(RNN)模拟,行动关系则由图RNN模拟。两个网络都是一起培训的,拟议方法在AVA数据集上取得了最新结果。