While transformers have begun to dominate many tasks in vision, applying them to large images is still computationally difficult. A large reason for this is that self-attention scales quadratically with the number of tokens, which in turn, scales quadratically with the image size. On larger images (e.g., 1080p), over 60% of the total computation in the network is spent solely on creating and applying attention matrices. We take a step toward solving this issue by introducing Hydra Attention, an extremely efficient attention operation for Vision Transformers (ViTs). Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as there are features, Hydra Attention is computationally linear in both tokens and features with no hidden constants, making it significantly faster than standard self-attention in an off-the-shelf ViT-B/16 by a factor of the token count. Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases, actually improves it.
翻译:虽然变压器开始在视觉上主宰许多任务,但将变压器应用到大型图像仍然难以计算。 其主要原因是,自注意度的尺度与牌子数量成正方形,而牌子数量又成正方形。 在较大的图像上(例如1080p),网络总计算量的60%以上只花在创建和应用关注矩阵上。 我们通过引入“注意”来解决这一问题迈出了一步。 “注意”是“注意”是“注意”的极有效操作。 奇怪的是,这种效率来自多头关注其极端之处:通过使用尽可能多的注意头,“注意”在标志和特性上都是计算线性的,没有隐藏的常数,使“注意”比标准自我注意速度要快得多。 此外,“注意”在图像网络上保持高度精确性,在某些情况下,实际上改进了它。