Few-shot learning aims to recognize novel classes from a few examples. Although significant progress has been made in the image domain, few-shot video classification is relatively unexplored. We argue that previous methods underestimate the importance of video feature learning and propose to learn spatiotemporal features using a 3D CNN. Proposing a two-stage approach that learns video features on base classes followed by fine-tuning the classifiers on novel classes, we show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks. To circumvent the need of labeled examples, we present two novel approaches that yield further improvement. First, we leverage tag-labeled videos from a large dataset using tag retrieval followed by selecting the best clips with visual similarities. Second, we learn generative adversarial networks that generate video features of novel classes from their semantic embeddings. Moreover, we find existing benchmarks are limited because they only focus on 5 novel classes in each testing episode and introduce more realistic benchmarks by involving more novel classes, i.e. few-shot learning, as well as a mixture of novel and base classes, i.e. generalized few-shot learning. The experimental results show that our retrieval and feature generation approach significantly outperform the baseline approach on the new benchmarks.
翻译:尽管在图像领域取得了显著进展,但少发视频分类却相对没有探索。我们争辩说,先前的方法低估了视频特征学习的重要性,并提议使用3DCNN来学习时空特征。我们建议采用两阶段方法,在基班上学习视频特征,然后对新类的分类员进行微调,然后对新类进行微调。我们发现,这种简单的基线方法比以前少发视频分类方法高出了20多个现有基准的20多个点。为避免使用标签实例的需要,我们提出了两种新颖方法,可以进一步改进。首先,我们利用标签检索从大型数据集中获取标签标签贴标签视频,然后选择具有视觉相似性的最佳剪辑。第二,我们学习基因化对抗网络,在基班上学习新类的视频特征,然后对小类的分类员进行微调。此外,我们发现现有的基准有限,因为它们仅侧重于每个测试集的5个新科班,并且通过更多新颖的类,即少发学习,以及新颖和基础类的组合,是新颖和基础级方法的组合。