The goal of few-shot video classification is to learn a classification model with good generalization ability when trained with only a few labeled videos. However, it is difficult to learn discriminative feature representations for videos in such a setting. In this paper, we propose Temporal Alignment Prediction (TAP) based on sequence similarity learning for few-shot video classification. In order to obtain the similarity of a pair of videos, we predict the alignment scores between all pairs of temporal positions in the two videos with the temporal alignment prediction function. Besides, the inputs to this function are also equipped with the context information in the temporal domain. We evaluate TAP on two video classification benchmarks including Kinetics and Something-Something V2. The experimental results verify the effectiveness of TAP and show its superiority over state-of-the-art methods.
翻译:短片视频分类的目的是在只用少数贴标签的视频进行培训时学习具有良好概括能力的分类模式,然而,很难了解在这种环境下视频的歧视性特征表现。在本文中,我们建议根据对短片视频分类的顺序相似性学习而提出时间对齐预测。为了获得一对视频的相似性,我们预测了两段视频中所有对齐的时间位置与时间对齐预测功能之间的匹配分数。此外,这一功能的投入也配备了时间域的上下文信息。我们评估了两个视频分类基准的TAP,包括动因学和某些东西的V2。实验结果验证了TAP的有效性,并展示了其优于最新方法的优势。