The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
翻译:等离子计算复杂度限制了视觉变异器( ViTs) 的实际应用。 数个作品提议将冗余的符号刻录为多余的符号, 以高效的 ViTs 。 但是, 这些方法一般会受到以下因素的影响:( 一) 急剧的精确度下降, (二) 本地视觉变异器的应用困难, (三) 下游任务的非通用网络。 在这项工作中, 我们建议为高效的全球和地方视觉变异器( STVT) 建立一个新型的Semanic Token ViT (STVT) (STVT) (STVit), 也可以修改为下游任务的主干线。 语义符号代表集中心, 并且它们通过在空间中存储下流的图像变异性图像标记, 并且通过关注来初始化地显示全球或本地的语系信息。 由于集特性, 少数语系符号标记可以达到与全球和本地变异性变异的图像值。 例如, 在DeT- (Ti, Smary Stal) 格式变异性变现中, 我们的Stary Stal Stal Stal Stalifilt) 可以变换了20 的方法可以进一步变现到我们变现为Sestr 。</s>