Video transformers have achieved impressive results on major video recognition benchmarks, however they suffer from high computational cost. In this paper, we present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples. Specifically, we formulate token selection as a ranking problem, which estimates the importance of each token through a lightweight selection network and only those with top scores will be used for downstream evaluation. In the temporal dimension, we keep the frames that are most relevant for recognizing action categories, while in the spatial dimension, we identify the most discriminative region in feature maps without affecting spatial context used in a hierarchical way in most video transformers. Since the decision of token selection is non-differentiable, we employ a perturbed-maximum based differentiable Top-K operator for end-to-end training. We conduct extensive experiments on Kinetics-400 with a recently introduced video transformer backbone, MViT. Our framework achieves similar results while requiring 20% less computation. We also demonstrate that our approach is compatible with other transformer architectures.
翻译:视频变压器在主要视频识别基准上取得了令人印象深刻的成果,尽管它们有很高的计算成本。 在本文中,我们展示了STTS, 这是一个象征性的选择框架,以动态方式选择以输入视频样本为条件的时空维度的若干信息标志。 具体地说,我们将象征性选择作为一个排名问题,通过轻量选择网络来估计每个象征的重要性,只有那些得分最高者才能用于下游评估。 在时间层面,我们保留与承认行动类别最相关的框架,而在空间层面,我们在地貌图中确定最受歧视的区域,而不影响大多数视频变压器所使用的等级空间环境。由于象征性选择的决定是不可区分的,我们使用一个基于不同差异的顶部操作器进行端对端培训。我们用最近推出的视频变压器骨干对基MViT进行广泛的实验。 我们的框架取得了类似的结果,同时要求减少20%的计算。 我们还证明我们的方法与其他变压器结构是兼容的。