Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input diversity and training stability. Moreover, a conditional-exit technique is proposed to perform temporal adaptive computation on top of AdaFocus without additional training. Extensive experiments on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, and Jester) demonstrate that our model significantly outperforms the original AdaFocus and other competitive baselines, while being considerably more simple and efficient to train. Code is available at https://github.com/LeapLabTHU/AdaFocusV2.
翻译:最近的工作表明,通过减少空间冗余,可大大提高视频识别的计算效率。作为一种具有代表性的工作,适应重点方法(AdaFocus)通过动态识别和关注每个视频框架中的信息区,在准确性和推断速度之间实现了有利的权衡,动态识别和关注每个视频框架中的信息区;然而,AdaFocus需要复杂的三阶段培训管道(涉及强化学习),导致缓慢趋同,对从业人员不友好。这项工作将AdaFocus作为简单的一阶段算法的培训重新配置为简单的一级算法,为此引入了一种不同的基于内插的补丁选择操作,使高效的端到端优化。我们进一步介绍了一个改进的培训计划,以解决一阶段配方提出的问题,包括缺乏监督、投入多样性和培训稳定性。此外,还提议了一种有条件外选技术,在AdaFocus顶部进行时间适应性计算,无需额外培训。在六个基准数据集(即活动网、FCVID、Mini-Kinetictrition、Some-Oma-Oma-Vth V1 & V2)和JesterAxim Acal Acal Accreal 正在大量地展示我们原始/Fdeal 和常规/Forma basyal 并具有竞争性的原始/Formodestrismal 并具备其他标准基线,而具有相当的初始和具有相当的初始和具有相当的模型,并具有相当的基底基)。