Linear attention mechanisms provide hope for overcoming the bottleneck of quadratic complexity which restricts application of transformer models in vision tasks. We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers like Performer, Linformer and Nystr\"omformer of linear complexity creating Vision X-formers (ViX). We show that ViX performs better than ViT in image classification consuming lesser computing resources. We further show that replacing the embedding linear layer by convolutional layers in ViX further increases their performance. Our test on recent visions transformer models like LeViT and Compact Convolutional Transformer (CCT) show that replacing the attention with Nystr\"omformer or Performer saves GPU usage and memory without deteriorating performance. Incorporating these changes can democratize transformers by making them accessible to those with limited data and computing resources.
翻译:线性关注机制为克服限制变压器模型在愿景任务中应用的二次复杂度的瓶颈提供了希望。 我们修改VIT结构, 以以高效的变压器取代四重体关注, 从而改变长顺序数据, 例如表演者、 Linexter 和 Nystr\\ ” 创造Vision X- exists (VIX) 的线性复杂度。 我们显示, ViX 在消费较少计算资源的图像分类方面表现优于 ViX 。 我们进一步显示, 以ViX 中富集层取代嵌入的线性层会进一步提高它们的性能。 我们对最近的变压器模型( 如 LeViT 和 Contractical Convolutioner (CCT) 的测试显示, 以 Nystr\ omexer 或表演者 来取代注意力可以节省 GPU的使用和记忆而不会恶化性能 。 实现这些变压器的化, 通过让那些数据有限和计算资源的人使用这些变压器来使其民主化。