Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition to preserving the most discriminative local tokens, we merge similar inattentive tokens and match homogeneous attentive tokens to maximize the token diversity. Despite its simplicity, our method obtains a promising trade-off between model complexity and classification accuracy. On DeiT-S, our method reduces the FLOPs by 35% with only a 0.2% accuracy drop. Notably, benefiting from maintaining the token diversity, our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.
翻译:视觉变压器在各种视觉任务上取得了显著的改进,但是它们之间的四面形相互作用大大降低了计算效率。许多修剪方法最近被提议为高效视觉变压器去除多余的象征物。然而,现有的研究主要侧重于保存当地注意的象征物的象征重要性,但完全忽略了全球象征性多样性。在本文中,我们强调多样化全球语义学的至关重要性,并提议一种高效的象征脱钩和合并方法,可以共同考虑象征性裁剪的象征重要性和多样性。根据阶级象征性的关注,我们分解了关注和不注意的象征物。除了保存最具有歧视性的当地象征物外,我们还将相似的惯用象征物合并起来,并匹配同质的注意象征物以最大限度地增加象征性多样性。尽管我们的方法很简单,但在模型复杂性和分类准确性之间却取得了有希望的折价折。关于DeiT-S,我们的方法将FLOPs减少35%,只有0.2%的精度下降。值得注意的是,由于维持象征性的多样性,我们的方法甚至能够提高DiT-T的精度,在将FLOP减少40%之后,提高0.1%的精确度。