Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nystr\"omformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. We also propose a new training method where we use two different optimizers during different phases of training and show that it improves the top-1 image classification accuracy across different architectures. CXV outperforms other architectures, token mixers (e.g. ConvMixer, FNet and MLP Mixer), transformer models (e.g. ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources (cores, RAM, power).
翻译:视觉变压器发现,尽管在某些基准上具有最先进的精确度,但在处理图像方面,其实际用途有限,尽管其在某些基准上具有最先进的精确度,但是在处理图像方面发现,视觉变压器(ViTs)发现,其实际用途有限,其原因包括:由于其自我注意机制的四面形复杂性,它们需要更大的培训数据集和更多的计算资源(CNNs),因此,它们需要比进化神经网络(CNNs)更多地使用GPU。我们还提议了一种线性关注-进化混合结构 -- -- 视觉的Civilal X-exexex(CXV) -- -- 以克服这些限制。我们用线性关注机制取代二次关注,例如表演器、Nystr\'omfored和线性变压器等线式关注机制,以减少其GPU的用量。图像数据的导导导射前由同子层提供,从而消除对ViLT使用的类象征和定位嵌嵌嵌嵌。 我们还提议了一种新的培训方法,在不同的培训阶段使用两种不同的优化方法,并显示它改进了各个结构的图像分类精度图案(e.CViRCMMM.M.M.Mix和Rix、Flixermex、和Sermexmex、和Mexmexylex、Cylveylmex、C)。