Most action recognition solutions rely on dense sampling to precisely cover the informative temporal clip. Extensively searching temporal region is expensive for a real-world application. In this work, we focus on improving the inference efficiency of current action recognition backbones on trimmed videos, and illustrate that one action model can also cover then informative region by dropping non-informative features. We present Selective Feature Compression (SFC), an action recognition inference strategy that greatly increase model inference efficiency without any accuracy compromise. Differently from previous works that compress kernel sizes and decrease the channel dimension, we propose to compress feature flow at spatio-temporal dimension without changing any backbone parameters. Our experiments on Kinetics-400, UCF101 and ActivityNet show that SFC is able to reduce inference speed by 6-7x and memory usage by 5-6x compared with the commonly used 30 crops dense sampling procedure, while also slightly improving Top1 Accuracy. We thoroughly quantitatively and qualitatively evaluate SFC and all its components and show how does SFC learn to attend to important video regions and to drop temporal features that are uninformative for the task of action recognition.
翻译:多数行动识别方法都依靠密集的取样来精确地覆盖信息丰富的时间片段。 广泛搜索时间区对于真实世界应用来说费用昂贵。 在这项工作中,我们侧重于提高目前剪裁视频中动作识别主干柱的推断效率,并表明一个行动模型也可以通过丢弃非信息化特征来覆盖信息化区域。 我们展示了“选择性特效压缩”(SFC),即一个行动识别推论战略,它极大地提高了模型推导效率,而没有任何准确性妥协。 与以往压缩内核大小并减少频道维度的工程不同,我们提议在不改变任何主干参数的情况下,在时空维度上压缩特征。 我们在“动因果”-400、“UCFC-101”和“活动网”上的实验显示,SFC能够将推断速度减少6-7x,记忆用量减少5-6x,而通常使用的30个作物密集采样程序则略微改进Top1的准确性。 我们从定量和定性角度对SFC及其所有组件进行评估,并展示SFC如何学习如何关注重要的视频区域,并降低对行动不具有说服性的时空特性。