In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.
翻译:在最近的计算机视觉研究中,视觉变异器(VIT)的出现使各种建筑设计工作迅速发生了革命性的变化:VIT利用自然语言处理过程中发现的自我注意实现了最先进的图像分类性能,MLP-Mixer利用简单的多层透视器实现了竞争性性能。相比之下,一些研究还表明,经过仔细重新设计的演进神经网络(CNNs)可以在不采用这些新想法的情况下取得与VIT相近的先进性能。在这种背景下,人们越来越关心什么是适合计算机视觉的感知性偏差。在这里,我们建议了VIT的新型和竞争性的图像分类性能,它提供了对这些问题的新视角。与Vits不同的是,序列模型使用LSTMs而不是自我注意层的长期依赖性能。我们还提议了一个双维的序列神经网络模块,在这个模块中,一个LSTM已经分解成垂直和横向LSTMSM来提高性能。尽管它很简单,但一些实验显示S序列仪只以令人印象深刻的方式运行:序列2-D-L,它具有54M参数,我们实现了高分辨率。