Detecting and recognizing human action in videos with crowded scenes is a challenging problem due to the complex environment and diversity events. Prior works always fail to deal with this problem in two aspects: (1) lacking utilizing information of the scenes; (2) lacking training data in the crowd and complex scenes. In this paper, we focus on improving spatio-temporal action recognition by fully-utilizing the information of scenes and collecting new data. A top-down strategy is used to overcome the limitations. Specifically, we adopt a strong human detector to detect the spatial location of each frame. We then apply action recognition models to learn the spatio-temporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet, which can improve the generalization ability of our model. Besides, the scenes information is extracted by the semantic segmentation model to assistant the process. As a result, our method achieved an average 26.05 wf\_mAP (ranking 1st place in the ACM MM grand challenge 2020: Human in Events).
翻译:由于复杂的环境和多样性事件,在拥挤的场景的视频中检测和识别人类行为是一个具有挑战性的问题。以前的工作总是无法在两个方面处理这一问题:(1) 缺乏对场景的信息;(2) 缺乏人群和复杂场景的培训数据;在本文中,我们侧重于通过充分利用场景信息和收集新数据来改进时空行动识别。我们采用了自上而下的战略来克服这些限制。具体地说,我们采用了强大的人体探测器来探测每个场景的空间位置。然后我们运用行动识别模型从HIE数据集的视频框和互联网上不同场景的新数据中学习时空信息,这可以提高我们模型的普遍化能力。此外,通过语义分割模型提取场景信息来辅助这一过程。结果,我们的方法达到了平均2,50 wf ⁇ mAP(ACM MM 2020年第1位挑战: 人类事件)。