Audio-visual generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information in order to be able to recognise samples from novel, previously unseen classes at test time. The natural semantic and temporal alignment between audio and visual data in video data can be exploited to learn powerful representations that generalise to unseen classes at test time. We propose a multi-modal and Temporal Cross-attention Framework (\modelName) for audio-visual generalised zero-shot learning. Its inputs are temporally aligned audio and visual features that are obtained from pre-trained networks. Encouraging the framework to focus on cross-modal correspondence across time instead of self-attention within the modalities boosts the performance significantly. We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning. Code for reproducing all results is available at \url{https://github.com/ExplainableML/TCAF-GZSL}.
翻译:视频分类的视听通用零光学习要求理解视听信息之间的关系,以便能够在测试时识别来自新奇、以前不为人知的类中的样本。视频数据中的视听数据自然的语义和时间对齐可以用来学习在测试时对隐蔽类进行概括化的强大表现。我们建议为视听通用零光学习建立一个多式和时空交叉注意框架(模版名称),其投入是从培训前的网络获得的与时间一致的视听特征。鼓励框架侧重于跨时间跨时间的跨模式通信,而不是模式内的自我关注,大大提升了业绩。我们表明,我们提出的框架,即利用时间特征产生在\ucf、\vgg和(普遍化的)零光学上的最新表现。在\url{https://github.com/ExplainML/TCAF-GZSL}上可以找到所有结果的再生代码。