动态VVT:具有动态声调分化的高效愿景变形器 (DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification)

Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. To optimize the prediction module in an end-to-end manner, we propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens. Benefiting from the nature of self-attention, the unstructured sparse tokens are still hardware friendly, which makes our framework easy to achieve actual speed-up. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%~37% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision transformers. Equipped with the dynamic token sparsification framework, DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet. Code is available at https://github.com/raoyongming/DynamicViT

翻译：在视觉变压器中,人们很少注意。我们观察视觉变压器中的最后预测仅仅基于一组信息最丰富的象征物, 这足以准确的图像识别。基于此观察, 我们提议一个动态的象征质升级框架, 逐步和以输入为基础动态地处理多余的象征物。具体地说, 我们设计了一个轻量的预测模块, 以根据当前特性来估计每个象征物的重要性分数。该模块被添加到不同的层中, 将多余的象征物按等级排列。为了以端对端的方式优化预测模块, 我们建议通过屏蔽其与其他象征物的相互作用, 以不同方式淡化一个象征物。从自我注意的性质出发, 无结构的稀有象征物仍然对硬件友好。这样, 我们的框架很容易实现实际的加速。通过分级调整66%的输入物代号, 我们的方法大大降低了31%-37%的FLOPs, 将吞吐量提高40%以上, 而对于各种视觉变压器来说, 精确性下降为0.5 % 。在动态的调控压D- 变压框架中, 变压/ 变压/ 变压式变压/ 变压变压变压变压变压变压变压式变压变压变压变压变压变压变压变压变压变压变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变