This paper studies how to introduce viewpoint-invariant feature representations that can help action recognition and detection. Although we have witnessed great progress of action recognition in the past decade, it remains challenging yet interesting how to efficiently model the geometric variations in large scale datasets. This paper proposes a novel Spatial-Temporal Alignment Network (STAN) that aims to learn geometric invariant representations for action recognition and action detection. The STAN model is very light-weighted and generic, which could be plugged into existing action recognition models like ResNet3D and the SlowFast with a very low extra computational cost. We test our STAN model extensively on AVA, Kinetics-400, AVA-Kinetics, Charades, and Charades-Ego datasets. The experimental results show that the STAN model can consistently improve the state of the arts in both action detection and action recognition tasks. We will release our data, models and code.
翻译:本文研究如何引入有助于行动识别和检测的视角差异特征表现。尽管我们在过去十年中目睹了行动识别的巨大进展,但如何有效地模拟大规模数据集中的几何差异仍具有挑战性,但令人感兴趣的是,本文件提出了一个新的空间-时空协调网络(STAN),目的是学习行动识别和行为检测的几何差异表现。STAN模型非常轻巧和通用,可以以非常低的计算成本将它插入ResNet3D和Llow Fast等现有行动识别模型中。我们将在AVA、Kindics-400、AVA-Kinetics、Charades和Charades-Ego数据集中广泛测试我们的STAN模型。实验结果表明,STAN模型可以在行动识别和行动识别任务中不断改善艺术状况。我们将发布我们的数据、模型和代码。