HYRSM++: 混合关系制导时空设置匹配短片动作识别 (HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition)

from arxiv, An extended version of a paper arXiv:2204.13423 published in CVPR 2022. This work has been submitted to the Springer for possible publication

Recent attempts mainly focus on learning deep representations for each video individually under the episodic meta-learning regime and then performing temporal alignment to match query and support videos. However, they still suffer from two drawbacks: (i) learning individual features without considering the entire task may result in limited representation capability, and (ii) existing alignment strategies are sensitive to noises and misaligned instances. To handle the two limitations, we propose a novel Hybrid Relation guided temporal Set Matching (HyRSM++) approach for few-shot action recognition. The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations and involve a robust matching technique. To be specific, HyRSM++ consists of two key components, a hybrid relation module and a temporal set matching metric. Given the basic representations from the feature extractor, the hybrid relation module is introduced to fully exploit associated relations within and cross videos in an episodic task and thus can learn task-specific embeddings. Subsequently, in the temporal set matching metric, we carry out the distance measure between query and support videos from a set matching perspective and design a Bi-MHM to improve the resilience to misaligned instances. In addition, we explicitly exploit the temporal coherence in videos to regularize the matching process. Furthermore, we extend the proposed HyRSM++ to deal with the more challenging semi-supervised few-shot action recognition and unsupervised few-shot action recognition tasks. Experimental results on multiple benchmarks demonstrate that our method achieves state-of-the-art performance under various few-shot settings. The source code is available at https://github.com/alibaba-mmai-research/HyRSMPlusPlus.

翻译：最近的一些尝试主要侧重于在常规元学习制度下单独学习每个视频的深度表达方式,然后进行时间匹配,以匹配查询和支持视频。然而,这些尝试仍然有两个缺点:(一) 在不考虑整个任务的情况下学习个人特点可能导致代表性能力有限,以及(二) 现有的调整战略对噪音和不匹配事件十分敏感。为了处理这两个局限性,我们建议采用新的混合制导时间匹配(HyRSM+++)方法,以进行微小动作识别。 HYRSM++的核心想法是将所有视频都纳入任务中,以学习歧视性表达方式,并采用强有力的匹配技术。具体来说, HyRSM++由两个关键组成部分组成,即混合关系模块和时间设置匹配指标。鉴于功能提取器的基本表述,混合关系模块将充分利用内部和交叉视频的关系,以做一个缩略图任务,从而学习特定任务嵌入。随后,在时间设定的匹配度标准中,我们从一个设置的匹配角度进行查询和支持视频的源,并设计一个Bi-MHM的设置。具体来说,HM+由两个关键部分组成,一个混合关系模块构成一个混合关系模块,以提升系统下,以提升一个具有挑战性的视频的升级的升级的动作。我们的拟议的动作,以提升升级的动作,以提升的动作。