In video action recognition, transformers consistently reach state-of-the-art accuracy. However, many models are too heavyweight for the average researcher with limited hardware resources. In this work, we explore the limitations of video transformers for lightweight action recognition. We benchmark 13 video transformers and baselines across 3 large-scale datasets and 10 hardware devices. Our study is the first to evaluate the efficiency of action recognition models in depth across multiple devices and train a wide range of video transformers under the same conditions. We categorize current methods into three classes and show that composite transformers that augment convolutional backbones are best at lightweight action recognition, despite lacking accuracy. Meanwhile, attention-only models need more motion modeling capabilities and stand-alone attention block models currently incur too much latency overhead. Our experiments conclude that current video transformers are not yet capable of lightweight action recognition on par with traditional convolutional baselines, and that the previously mentioned shortcomings need to be addressed to bridge this gap. Code to reproduce our experiments will be made publicly available.
翻译:在视频动作识别中,变压器始终达到最先进的准确度。然而,许多模型对于使用有限硬件的普通研究人员来说过于重量重。在这项工作中,我们探索了视频变压器对于轻量动作识别的局限性。我们以13个视频变压器和3个大型数据集和10个硬件装置的基线为基准。我们的研究首次评估了多个设备的行动识别模型的深度,并在相同条件下培训了范围广泛的视频变压器。我们将当前方法分为三个类别,并表明,尽管缺乏准确性,但增强革命骨干的综合变压器最擅长轻量动作识别。与此同时,只关注型模型需要更多的运动建模能力,而独立关注区模型目前产生太多的悬浮性间接费用。我们的实验结论是,目前的变压器尚不具备与传统革命基线同等的轻量动作识别能力,需要解决上述缺陷以弥补这一差距。复制我们实验的代码将被公之于众。