Recent research has revealed that reducing the temporal and spatial redundancy are both effective approaches towards efficient video recognition, e.g., allocating the majority of computation to a task-relevant subset of frames or the most valuable image regions of each frame. However, in most existing works, either type of redundancy is typically modeled with another absent. This paper explores the unified formulation of spatial-temporal dynamic computation on top of the recently proposed AdaFocusV2 algorithm, contributing to an improved AdaFocusV3 framework. Our method reduces the computational cost by activating the expensive high-capacity network only on some small but informative 3D video cubes. These cubes are cropped from the space formed by frame height, width, and video duration, while their locations are adaptively determined with a light-weighted policy network on a per-sample basis. At test time, the number of the cubes corresponding to each video is dynamically configured, i.e., video cubes are processed sequentially until a sufficiently reliable prediction is produced. Notably, AdaFocusV3 can be effectively trained by approximating the non-differentiable cropping operation with the interpolation of deep features. Extensive empirical results on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2 and Diving48) demonstrate that our model is considerably more efficient than competitive baselines.
翻译:最近的研究表明,减少时间和空间冗余是高效视频识别的有效方法,例如,将大部分计算方法分配到与任务相关的一组框架或每个框架最有价值的图像区域。然而,在大多数现有工程中,两种类型的冗余通常都以另一种不存在的形式建模。本文件探索了在最近提议的AdaFocusV2算法之上统一制定空间时时动态计算方法,有助于改进AdaFocusV3框架。我们的方法通过将昂贵的高容量网络仅用于一些小型但信息丰富的3D视频立方体来降低计算成本。这些立方体是从由框架高度、宽度和视频持续时间组成的空间中裁剪裁的,而其位置则以每个抽样基础上的轻量化政策网络来调整。在测试时,与每部视频相对应的立方体数量是动态配置的,即视频立方体是按顺序处理的,直到产生足够可靠的预测。 值得注意的是,AdaFocuscusV3 和Dial-derial-deal-deal-deal-deal-deal-developal ex-deal-deal-deal-deal exactactactactactivactivactivacts