Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. In this paper, we introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters. To address the issue, we introduce a progressive training scheme by first training 8-bit transformers and then TerViT, and achieve a better optimization than conventional methods. Furthermore, we introduce channel-wise ternarization, by partitioning each matrix to different channels, each of which is with an unique distribution and ternarization interval. We apply our methods to popular DeiT and Swin backbones, and extensive results show that we can achieve competitive performance. For example, TerViT can quantize Swin-S to 13.1MB model size while achieving above 79% Top-1 accuracy on ImageNet dataset.
翻译:视觉变压器(VERVT)在各种视觉任务中表现出巨大的潜力,但在部署于资源受限制的装置时会遇到昂贵的计算和记忆成本问题。 在本文中,我们引入了一种永恒的视觉变压器(TerVIT)来改变VIT的重量,这些变压器受到真实价值和永久参数之间巨大的表面损失差距的挑战。为了解决这个问题,我们首先通过培训8位变压器,然后是TerVIT来引入一个渐进式培训计划,并实现比常规方法更好的优化。此外,我们引入了频道化,将每个矩阵分离到不同的频道,其中每一个都有一个独特的分布和梯度间隔。我们运用了我们的方法将流行的 DeiT 和 Swin 脊椎联系起来,并广泛的结果显示,我们可以取得竞争性的性表现。例如,TerVIT可以将 Swin-S 设为13.1MB 模型大小,同时在图像网络数据集上达到79%的顶端-1精确度以上。