Vision transformer has emerged as a new paradigm in computer vision, showing excellent performance while accompanied by expensive computational cost. Image token pruning is one of the main approaches for ViT compression, due to the facts that the complexity is quadratic with respect to the token number, and many tokens containing only background regions do not truly contribute to the final prediction. Existing works either rely on additional modules to score the importance of individual tokens, or implement a fixed ratio pruning strategy for different input instances. In this work, we propose an adaptive sparse token pruning framework with a minimal cost. Our approach is based on learnable thresholds and leverages the Multi-Head Self-Attention to evaluate token informativeness with little additional operations. Specifically, we firstly propose an inexpensive attention head importance weighted class attention scoring mechanism. Then, learnable parameters are inserted in ViT as thresholds to distinguish informative tokens from unimportant ones. By comparing token attention scores and thresholds, we can discard useless tokens hierarchically and thus accelerate inference. The learnable thresholds are optimized in budget-aware training to balance accuracy and complexity, performing the corresponding pruning configurations for different input instances. Extensive experiments demonstrate the effectiveness of our approach. For example, our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy, which achieves a better trade-off between accuracy and latency than the previous methods.
翻译:视觉变异器在计算机视野中出现了一个新的范例,显示优异的性能,并伴之以昂贵的计算成本。图像象征性的裁剪是维特压缩的主要方法之一,因为对于象征编号来说,复杂性是四重四,许多仅包含背景区域的小象征并不真正有助于最终预测。现有的作品要么依靠额外的模块来分分分别个别象征的重要性,要么对不同的输入实例实施固定比例的调整战略。在这项工作中,我们建议采用适应性稀薄的象征性标码框架,费用最低。我们的方法以可学习的阈值为基础,并利用多领导人自我意识来评估象征性信息,而很少增加操作。具体地说,我们首先提出低价的注意头重要性加权级评分机制。然后,在维特中插入可学习的参数,作为区分信息象征重要性的阈值,或者比较象征性的分数和阈值,我们可以丢弃无用的标值,从而加速推理。在预算意识培训中优化了临界性和复杂性。我们采用50种多层次的精确度,用相应的精确度来评估象征性的精确度,在不同的输入实例中进行相应的精度的精度的精确度的缩化配置。在不同的输入实例中可以改进我们之前的方法。