SwiftFormer：用于基于Transformer的移动实时视觉应用的高效加性注意力 (SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications)

Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2. Code: https://github.com/Amshaker/SwiftFormer

翻译：摘要：自注意力已成为捕获各种视觉应用中的全局上下文的事实标准选择。然而，其与图像分辨率相关的二次计算复杂度限制了其在实时应用中的使用，特别是在资源受限的移动设备上部署时。尽管已经提出了混合方法，将卷积和自注意力的优点结合起来以实现更好的速度和准确性平衡，但自注意力中昂贵的矩阵乘法操作仍然是瓶颈。在这项工作中，我们引入了一种新的高效加性注意力机制，该机制有效地将二次矩阵乘法操作替换为线性逐元素乘法。我们的设计表明，可以使用线性层替换键值交互而不会损失任何准确性。与先前的最新方法不同，我们对自我注意力的高效公式化使其可在网络的所有阶段中使用。使用我们提出的高效加性注意力，我们构建了一系列名为“SwiftFormer”的模型，其在准确性和移动推理速度方面均实现了最新成果。我们的小变体使用iPhone 14仅具有0.8毫秒的延迟即可实现78.5％的ImageNet-1K top-1准确度，比MobileViT-v2更准确且快2倍。代码：https://github.com/Amshaker/SwiftFormer

相关内容

自注意力

关注 13

利用注意力机制来“动态”地生成不同连接的权重，这就是自注意力模型（Self-Attention Model）. 注意力机制模仿了生物观察行为的内部过程，即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制。注意力机制可以快速提取稀疏数据的重要特征，因而被广泛用于自然语言处理任务，特别是机器翻译。而自注意力机制是注意力机制的改进，其减少了对外部信息的依赖，更擅长捕捉数据或特征的内部相关性

用于识别任务的视觉 Transformer 综述

专知会员服务

74+阅读 · 2023年2月25日