Recently, vision transformers have become very popular. However, deploying them in many applications is computationally expensive partly due to the Softmax layer in the attention block. We introduce a simple but effective, Softmax-free attention block, SimA, which normalizes query and key matrices with simple $\ell_1$-norm instead of using Softmax layer. Then, the attention block in SimA is a simple multiplication of three matrices, so SimA can dynamically change the ordering of the computation at the test time to achieve linear computation on the number of tokens or the number of channels. We empirically show that SimA applied to three SOTA variations of transformers, DeiT, XCiT, and CvT, results in on-par accuracy compared to the SOTA models, without any need for Softmax layer. Interestingly, changing SimA from multi-head to single-head has only a small effect on the accuracy, which simplifies the attention block further. The code is available here: $\href{https://github.com/UCDvision/sima}{\text{This https URL}}$
翻译:最近,视觉变压器变得非常流行。 但是, 在许多应用中部署视觉变压器的计算成本很高, 部分是由于关注区中的软形层。 我们引入了一个简单但有效的软式无关注区块, 即SimA, 它使查询和关键矩阵正常化, 使用简单的$\ ell_ $1$- norm, 而不是使用软体层。 然后, SimA 中的注意区块是一个简单的三个矩阵的乘法, 所以SimA 可以动态地改变测试时的计算顺序, 以便实现对标志数或频道数的线性计算。 我们从经验上显示, SimA 应用到三个变压器( DeiT, XiT, 和 CvT) 的SOT 变换式, 与SOTA 模型相比结果的精确度是相同的, 无需使用软体层。 有趣的是, 将SimA从多头变换成单头对精确度作用很小, 使关注区块更简单。 。 代码在这里可以查到 : $hrfef{ http://githoubb.com/ UCDif/ / mmus/ simmax {