We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be discriminative and more semantically consistent. In zero-shot inference, we devise a simple semantic transfer scheme that embeds semantic relatedness information between seen and unseen classes to composite unseen visual prototypes. Accordingly, the discriminative features in the visual structure could be preserved and exploited to alleviate the typical zero-shot issues of information loss, semantic gap, and the hubness problem. Under a rigorous zero-shot setting of not pre-training on additional datasets, the experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets. Code will be made available.
翻译:我们提出了一个跨现代变异器框架,共同将视频数据和文本标签编码为零射动作识别(ZSAR),我们的模型采用了一种概念上的新管道,通过这一管道,与视觉-语义协会一起,以端对端的方式学习视觉表现;模型设计为在共享的知识空间学习视觉和语义表达提供了一个自然机制,通过这一机制,我们鼓励了有学识的视觉嵌入具有歧视性,并且更具有语义一致性。在零射出的推断中,我们设计了一个简单的语义转换方案,将视觉和看不见的类别之间的语义相关性信息嵌入综合的看不见视觉原型。因此,视觉结构中的歧视性特征可以保存和利用,以缓解典型的零射信息损失、语义差距和枢纽问题等零射问题。在严格的零射场设置下,实验结果显示我们的模式在ZSAR的艺术状态上有很大的改进,在UCF101、HMDB51和ActionNet基准数据设置上达到了鼓励的上一精确度。