In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency. It is observed that the most informative region in each frame of a video is usually a small image patch, which shifts smoothly across frames. Therefore, we model the patch localization problem as a sequential decision task, and propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus). In specific, a light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions. Then the selected patches are inferred by a high-capacity network for the final prediction. During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices. In addition, we demonstrate that the proposed method can be easily extended by further considering the temporal redundancy, e.g., dynamically skipping less valuable frames. Extensive experiments on five benchmark datasets, i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, demonstrate that our method is significantly more efficient than the competitive baselines. Code will be available at https://github.com/blackfeather-wang/AdaFocus.
翻译:在本文中,我们探索了视频识别的空间冗余,目的是提高计算效率。我们观察到,每个视频框中信息最丰富的区域通常是一个小的图像补丁,它通常会顺利地跨框架移动。因此,我们将补丁本地化问题模拟为顺序决定任务,并提议一个基于强化学习的高效空间适应视频识别方法(AdaFocus ) 。具体地说,我们首先采用轻量级ConvNet来快速处理全视频序列,其特征被一个经常性政策网络用来将最任务相关区域本地化。然后,选定的补丁由一个高容量网络来推断最终预测。在离线推论中,一旦生成了信息补丁序列,大部分计算可以平行进行,并且对现代GPUP设备有效。此外,我们证明,进一步考虑时间冗余,例如,动态跳过价值较低的框架,可以很容易扩展拟议方法。在五个基准数据集上进行广泛的实验,例如,活动Net, FCVID, Mini-Kinetistephics, Some-mas-maximabxyal Vrmabs) laxy Vax1 Vrusbs-s-husbbs is be greabs be latitutional latitudebs