高效愿景变异器和革命神经网络动态空间分化 (Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks)

In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input to accelerate vision Transformers. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. While the framework is inspired by our observation of the sparse attention in vision Transformers, we find the idea of adaptive and asymmetric computation can be a general solution for accelerating various architectures. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks that require structured feature maps by formulating a more generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations. By applying lightweight fast paths to less informative features and using more expressive slow paths to more important locations, we can maintain the structure of feature maps while significantly reducing the overall computations. Extensive experiments demonstrate the effectiveness of our framework on various modern architectures and different visual recognition tasks. Our results clearly demonstrate that dynamic spatial sparsification offers a new and more effective dimension for model acceleration. Code is available at https://github.com/raoyongming/DynamicViT

翻译：在本文中,我们通过利用视觉数据中的空间宽度来展示模型加速的新方法。我们观察到,视觉变异器的最后预测只能基于一个信息最丰富的象征子子集,这足以准确识别图像。基于这一观察,我们提出了一个动态的象征性变异框架,以便根据加速视觉变异器的输入,逐步和动态地处理多余的象征物。具体地说,我们设计了一个轻量的预测模块,以根据当前特点来估计每个象征物的重要性分数。该模块被添加到不同层中,以便按等级排列多余的象征物。虽然我们观测到视觉变异器中的注意力稀少,因此,我们发现适应和不对称计算的概念可以成为加速各种结构的总体解决办法。我们建议我们的方法扩大到等级模型,包括CNN和等级变异视器,以及更复杂的预测任务,这些任务需要结构化的地貌图,即设计一个更具活力的通用空间扰动性空间变异框架,同时对不同空间地点进行渐进式变异的模型和不对称的计算。通过较轻量的快速路径对信息特性应用,并使用较慢的路径到更重要的位置,我们发现适应和不对称的计算,但是,我们可以在各种图像变现的图像变异的图像变异的图像变异的模型的模型的模型结构上展示。