While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction. In this work, we propose a novel end-to-end trained transformer model which is capable of capturing long range spatiotemporal dependencies efficiently, contrary to existing approaches which use 3D-CNNs. Moreover, to address a common ambiguity in the existing works about classes that can be considered as previously unseen, we propose a new experimentation setup that satisfies the zero-shot learning premise for action recognition by avoiding overlap between the training and testing classes. The proposed approach significantly outperforms the state of the arts in zero-shot action recognition in terms of the the top-1 accuracy on UCF-101, HMDB-51 and ActivityNet datasets. The code and proposed experimentation setup are available in GitHub: https://github.com/Secure-and-Intelligent-Systems-Lab/SemanticVideoTransformer
翻译:虽然多年来视频行动识别一直是研究的一个积极领域,但零点行动识别直到最近才开始获得牵引力。在这项工作中,我们提出一个新的端到端经过培训的变压器模型,该模型能够有效地捕捉远距离时地依赖性,这与使用3D-CNN的现有方法相反。此外,为了解决现有工作中关于可被视为先前看不见的班级的共同模糊不清问题,我们提议一个新的实验设置,通过避免培训和测试班的重叠,满足零点学习对行动识别的前提。拟议方法在UCF-101、HMDB-51和活动网数据组的顶端-1精确度方面大大优于艺术状态的零点行动识别。代码和拟议的实验设置可在GitHub查阅:https://github.com/secure-and-Intellgents-Systems-Lab/SemanticVideoTransforn。