Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie, EMA-Teacher) to cope with unlabeled video samples. While a wide range of data augmentations have been shown effective for semi-supervised image classification, they generally produce limited results for video recognition. We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis. In addition, we propose a temporal warping augmentation to cover the complex temporal variation in videos, which stretches selected frames to various temporal durations in the clip. Extensive experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400. Our method can hopefully serve as a strong benchmark and encourage future search on semi-supervised action recognition with Transformer networks.
翻译:半监督动作识别是一项具有挑战性但至关重要的任务,因为视频注释的成本很高。现有方法主要使用卷积神经网络,但当前革命性的视觉Transformer模型尚未得到充分探索。在本文中,我们研究了在SSL设置下使用Transformer模型进行动作识别的方法。为此,我们引入了SVFormer,它采用了稳定的伪标签框架(即EMA-Teacher)来处理未标记的视频样本。尽管广泛的数据增强已经被证明对半监督图像分类有效,但它们通常对视频识别产生有限的效果。因此,我们为视频数据量身定制了一种新的增强策略,称为Tube TokenMix,其中视频剪辑通过掩码进行混合,在时间轴上具有一致的掩码标记。此外,我们还提出了一种时间扭曲增强方法,以覆盖视频中复杂的时间变化,将选择的帧在剪辑中拉伸到不同的时间持续时间。在Kinetics-400、UCF-101和HMDB-51三个数据集上进行了广泛的实验,验证了SVFormer的优势。特别地,SVFormer在少于Kinetics-400 1%标注率的情况下,在较少的训练时期内超越了现有技术的31.5%。希望我们的方法能够作为强有力的基准,并鼓励未来在Transformer网络上进行半监督动作识别的研究。