Semi-supervised video action recognition tends to enable deep neural networks to achieve remarkable performance even with very limited labeled data. However, existing methods are mainly transferred from current image-based methods (e.g., FixMatch). Without specifically utilizing the temporal dynamics and inherent multimodal attributes, their results could be suboptimal. To better leverage the encoded temporal information in videos, we introduce temporal gradient as an additional modality for more attentive feature extraction in this paper. To be specific, our method explicitly distills the fine-grained motion representations from temporal gradient (TG) and imposes consistency across different modalities (i.e., RGB and TG). The performance of semi-supervised action recognition is significantly improved without additional computation or parameters during inference. Our method achieves the state-of-the-art performance on three video action recognition benchmarks (i.e., Kinetics-400, UCF-101, and HMDB-51) under several typical semi-supervised settings (i.e., different ratios of labeled data).
翻译:半监督的视频动作识别往往能够使深神经网络取得显著的性能,即使有非常有限的标签数据。然而,现有方法主要是从目前基于图像的方法(例如,FixMatch)中转换而来(例如,FixMatch),不具体利用时间动态和固有的多式联运特性,其结果可能是不理想的。为了更好地利用视频中的编码时间信息,我们引入了时间梯度,作为本文中更加关注特征提取的一种额外方式。具体地说,我们的方法明确从时间梯度(TG)中提取细微的微微动作表达,并在不同模式(即,RGB和TG)中规定一致性。在不进行其他计算或参数推断的情况下,半监督行动识别的性能得到显著改善。我们的方法在一些典型的半监督环境(即,标签数据的不同比率)下,在三种视频动作识别基准(即,Kinetics-400、UCF-101和HMDB-51)上取得最先进的表现。