Efficient video action recognition remains a challenging problem. One large model after another takes the place of the state-of-the-art on the Kinetics dataset, but real-world efficiency evaluations are often lacking. In this work, we fill this gap and investigate the use of transformers for efficient action recognition. We propose a novel, lightweight action recognition architecture, VideoLightFormer. In a factorized fashion, we carefully extend the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model. Existing methods often resort to one of the two extremes, where they either apply huge transformers to video features, or minimal transformers on highly pooled video features. Our method differs from them by keeping the transformer models small, but leveraging full spatiotemporal feature structure. We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-Something-V2 (SSV2) datasets and find that it achieves a better mix of efficiency and accuracy than existing state-of-the-art models, apart from the Temporal Shift Module on SSV2.
翻译:高效视频动作识别仍然是一个具有挑战性的问题。 在动因数据集中,一个又一个大型模型在另一个模型之后取代了最新技术,但现实世界效率评估往往缺乏。 在这项工作中,我们填补这一空白并调查变压器的使用,以高效行动识别。 我们提议了一个新型轻量行动识别架构, 视频光辉Former。 我们以一个因数化的方式仔细扩展2D时空段网络, 使用变压器, 在整个模型中保持空间和时间视频结构。 现有方法往往采用两种极端之一, 即对视频功能应用巨大的变压器, 或对高度集中的视频特征使用最小变压器。 我们的方法不同于它们, 办法是保持变压器模型小, 但充分利用全场时性特征结构。 我们从高效率角度评价视频视野, 在需要时间的 IPIC- KITCHENS-100 和某些东西V2 (SSV2) 数据集上, 我们发现它比现有的状态变压式模型更好地组合了效率和精度和精度, SSV2 。