Few-shot action recognition has attracted increasing attention due to the difficulty in acquiring the properly labelled training samples. Current works have shown that preserving spatial information and comparing video descriptors are crucial for few-shot action recognition. However, the importance of preserving temporal information is not well discussed. In this paper, we propose a Contents and Length-based Temporal Attention (CLTA) model, which learns customized temporal attention for the individual video to tackle the few-shot action recognition problem. CLTA utilizes the Gaussian likelihood function as the template to generate temporal attention and trains the learning matrices to study the mean and standard deviation based on both frame contents and length. We show that even a not fine-tuned backbone with an ordinary softmax classifier can still achieve similar or better results compared to the state-of-the-art few-shot action recognition with precisely captured temporal attention.
翻译:由于难以获得贴上适当标签的培训样本,少发行动识别已引起越来越多的关注。目前的工作表明,保存空间信息和比较视频描述符对于几发行动识别至关重要。然而,没有很好地讨论保存时间信息的重要性。在本文件中,我们建议采用一个内容和长效时间关注模型(CLTA ), 来学习个人视频的定制时间关注, 以解决微发行动识别问题。 CLTA 利用高斯概率功能作为模板来引起时间关注, 并培训学习矩阵, 以研究基于框架内容和长度的中值和标准偏差。 我们表明,即使没有精细调整的骨架加上普通软式软式麦克斯分类器, 也能够取得类似或更好的效果, 与最先进的微发行动识别相比, 精确地抓住了时间关注 。