Transformer is a transformative framework that models sequential data and has achieved remarkable performance on a wide range of tasks, but with high computational and energy cost. To improve its efficiency, a popular choice is to compress the models via binarization which constrains the floating-point values into binary ones to save resource consumption owing to cheap bitwise operations significantly. However, existing binarization methods only aim at minimizing the information loss for the input distribution statistically, while ignoring the pairwise similarity modeling at the core of the attention. To this end, we propose a new binarization paradigm customized to high-dimensional softmax attention via kernelized hashing, called EcoFormer, to map the original queries and keys into low-dimensional binary codes in Hamming space. The kernelized hash functions are learned to match the ground-truth similarity relations extracted from the attention map in a self-supervised way. Based on the equivalence between the inner product of binary codes and the Hamming distance as well as the associative property of matrix multiplication, we can approximate the attention in linear complexity by expressing it as a dot-product of binary codes. Moreover, the compact binary representations of queries and keys enable us to replace most of the expensive multiply-accumulate operations in attention with simple accumulations to save considerable on-chip energy footprint on edge devices. Extensive experiments on both vision and language tasks show that EcoFormer consistently achieves comparable performance with standard attentions while consuming much fewer resources. For example, based on PVTv2-B0 and ImageNet-1K, Ecoformer achieves a 73% on-chip energy footprint reduction with only a 0.33% performance drop compared to the standard attention. Code is available at https://github.com/ziplab/EcoFormer.
翻译:变形器是一个改造框架,它可以模拟数据,并在一系列广泛的任务中取得显著的绩效,但是在计算成本和能源成本高的情况下,我们提出一个新的二进化模式。为了提高效率,一个流行的选择是通过二进制将模型压缩成模型,将浮动点值限制在二进制值中,以节省资源消耗量,因为这个二进制操作成本低廉。然而,现有的二进化方法的目的只是将投入分配的信息损失减少到最低程度,同时忽略核心关注点的相近模型。为此,我们建议了一个新的二进制化模式,通过内层化的仓储、称为 EcoFormer, 将模型的原始查询和键压缩成二进制的二进制,将漂浮点值限制在Hamming空间,将浮动点值值限制成二进制的二进制二进制的二进制的二进制的二进制。 内置制的二进制的二进制的二进制的二进制的二进制的二进制的二进制的二进制的进制的进制的进制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内制的内的内的内的内的内的内的内的内的内的内的内的内的内的内制的内制的内制的内制的内制的内制的内制的内制的内