SOFT: 具有线性复杂度的无软式变形器 (SOFT: Softmax-free Transformer with Linear Complexity)

Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. However, the employment of self-attention modules results in a quadratic complexity in both computation and memory usage. Various attempts on approximating the self-attention computation with linear complexity have been made in Natural Language Processing. However, an in-depth analysis in this work shows that they are either theoretically flawed or empirically ineffective for visual recognition. We further identify that their limitations are rooted in keeping the softmax self-attention during approximations. Specifically, conventional self-attention is computed by normalizing the scaled dot-product between token feature vectors. Keeping this softmax operation challenges any subsequent linearization efforts. Based on this insight, for the first time, a softmax-free transformer or SOFT is proposed. To remove softmax in self-attention, Gaussian kernel function is used to replace the dot-product similarity without further normalization. This enables a full self-attention matrix to be approximated via a low-rank matrix decomposition. The robustness of the approximation is achieved by calculating its Moore-Penrose inverse using a Newton-Raphson method. Extensive experiments on ImageNet show that our SOFT significantly improves the computational efficiency of existing ViT variants. Crucially, with a linear complexity, much longer token sequences are permitted in SOFT, resulting in superior trade-off between accuracy and complexity.

翻译：视觉变压器( ViTs) 推动了各种视觉识别任务的最先进技术, 其方法是通过贴近时保持软模量自我自省, 并随后自我注意。但是, 使用自我注意模块在计算和记忆使用方面造成四重复杂。在自然语言处理中, 尝试了将自我注意计算与线性复杂程度相匹配的各种尝试。但是, 这项工作的深入分析显示, 它们要么在理论上存在缺陷, 要么在视觉识别上是无效的。我们进一步确定它们的局限性的根源在于保持软模量自省在近似期间的自我注意。具体地说, 常规自我注意是允许的, 将象征物向量向量矢量向产品之间调整成正态。保持这种软模量操作对随后的线性努力提出了挑战。基于这一认识, 第一次提出了软模量变压变压器或SOFT 。为了消除自我注意的软成软体, Gausian 内核功能被用来在不进一步正常化的情况下取代软模相近似产品。常规自定义的精度的精度的精度精度, 通过SOreval- prial- prealal comalalalalalal comaltistral rogration矩阵在通过Sloveal- degres- rofal- degal- rofal degal degal degal develde rofal commod commal commlusmlus rodu commbildsmlation commusmlation 上, commusmbildaldsmusmbildald 上, 上, 上, rotodal