IEEE游戏汇刊(T-G)发表关于游戏的科学、技术和工程方面的高质量原创文章。本杂志的文章按照IEEE PSPB操作手册(章节8.2.1.C和8.2.2.A)的要求进行同行评审。每一篇发表的文章都由至少两名独立的审稿人通过单盲的同行评审过程进行评审,审稿人的身份作者并不知道,但审稿人知道作者的身份。文章在被接受前筛选是否抄袭。 官网地址:


Semi-supervised video action recognition tends to enable deep neural networks to achieve remarkable performance even with very limited labeled data. However, existing methods are mainly transferred from current image-based methods (e.g., FixMatch). Without specifically utilizing the temporal dynamics and inherent multimodal attributes, their results could be suboptimal. To better leverage the encoded temporal information in videos, we introduce temporal gradient as an additional modality for more attentive feature extraction in this paper. To be specific, our method explicitly distills the fine-grained motion representations from temporal gradient (TG) and imposes consistency across different modalities (i.e., RGB and TG). The performance of semi-supervised action recognition is significantly improved without additional computation or parameters during inference. Our method achieves the state-of-the-art performance on three video action recognition benchmarks (i.e., Kinetics-400, UCF-101, and HMDB-51) under several typical semi-supervised settings (i.e., different ratios of labeled data).