We present MetaUVFS as the first Unsupervised Meta-learning algorithm for Video Few-Shot action recognition. MetaUVFS leverages over 550K unlabeled videos to train a two-stream 2D and 3D CNN architecture via contrastive learning to capture the appearance-specific spatial and action-specific spatio-temporal video features respectively. MetaUVFS comprises a novel Action-Appearance Aligned Meta-adaptation (A3M) module that learns to focus on the action-oriented video features in relation to the appearance features via explicit few-shot episodic meta-learning over unsupervised hard-mined episodes. Our action-appearance alignment and explicit few-shot learner conditions the unsupervised training to mimic the downstream few-shot task, enabling MetaUVFS to significantly outperform all unsupervised methods on few-shot benchmarks. Moreover, unlike previous few-shot action recognition methods that are supervised, MetaUVFS needs neither base-class labels nor a supervised pretrained backbone. Thus, we need to train MetaUVFS just once to perform competitively or sometimes even outperform state-of-the-art supervised methods on popular HMDB51, UCF101, and Kinetics100 few-shot datasets.
翻译:我们推出MetUVFS, 作为第一个未受监督的视频模拟算法, 用于视频“少拍动作”的识别。 MetUVFS通过对比性学习分别捕捉外观特定空间和特定行动的时空视频功能,将超过550K的未贴标签的视频用于培训双流 2D 和 3D CNN 结构。 MetUVFS 包含一个新型的A- Apperance 增强统一元适应(A3M) 模块, 该模块学习关注与外观特征有关的面向行动的视频功能。 MetA- Appearance Appearance Asso- Spoot-Shoot-adation (A3M) 模块, 该模块通过清晰的短短片偶发的超前的超前的超前的超前的超前主干网学习。 因此, 我们需要在MetAUFS100号中对未受监督的超前主干网数据进行一次测试。