Zero-Shot Action Recognition (ZSAR) aims to recognize video actions that have never been seen during training. Most existing methods assume a shared semantic space between seen and unseen actions and intend to directly learn a mapping from a visual space to the semantic space. This approach has been challenged by the semantic gap between the visual space and semantic space. This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap and, hence, effectively, assist the learning. In particular, a simple hallucination network is proposed to implicitly extract object semantics during testing without explicitly extracting objects and a cross-attention module is developed to augment visual feature with the object semantics. Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.
翻译:零热行动识别(ZSAR)旨在识别培训期间从未见过的视频动作,大多数现有方法假定在可见和看不见的行动之间有共同的语义空间,并打算直接从视觉空间到语义空间学习绘图。这种方法受到视觉空间和语义空间之间语义差距的挑战。本文介绍了一种新颖的方法,利用物体语义学作为特惠信息缩小语义差距,从而有效地协助学习。特别是,建议建立一个简单的幻觉网络,在测试期间隐含地提取物体语义,而不明确提取物体,并开发一个交叉注意模块,以扩大物体语义学的视觉特征。奥林匹克运动、HMDB51和UCF101数据集的实验显示,拟议的方法大大超越了最先进的方法。