In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6\% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.
翻译:在最近的计算机视觉研究中,视觉变异器(VIT)的出现使各种建筑设计工作迅速发生了革命性的变化:VIT利用自然语言处理过程中发现的自我注意,实现了最先进的图像分类性能,MLP-Mixer利用简单的多层透视系统实现了竞争性性能,而MLP-Mixer则采用简单的多层透视系统,与此相反,一些研究还表明,经过仔细重新设计的演进神经网络(CNNs)可以在不采用这些新想法的情况下取得与VIT相类似的先进性能。在这种背景下,人们越来越关心什么是适合计算机视觉的感官偏向性偏向性。在这里,我们建议了ViTSercer(VIT)的新型和竞争性的图像分类性能。与Vits(Vites)不同的是,序列模型的远距离依赖性能模型使用LSTMMs而不是自我注意层。我们还提出了一个双维的模块,在这个模块中,一个LSTM(CNN)在垂直和横向LSTMMMM(LS)中解析定型号中,只有54M-1的高度分辨率,我们实现了高分辨率。