Following the success in language domain, the self-attention mechanism (transformer) is adopted in the vision domain and achieving great success recently. Additionally, as another stream, multi-layer perceptron (MLP) is also explored in the vision domain. These architectures, other than traditional CNNs, have been attracting attention recently, and many methods have been proposed. As one that combines parameter efficiency and performance with locality and hierarchy in image recognition, we propose gSwin, which merges the two streams; Swin Transformer and (multi-head) gMLP. We showed that our gSwin can achieve better accuracy on three vision tasks, image classification, object detection and semantic segmentation, than Swin Transformer, with smaller model size.
翻译:在语言领域取得成功之后,在视觉领域采用了自我注意机制( Transformation),最近取得了巨大成功。此外,作为另一个流,多层感应器(MLP)也在视觉领域进行探索。这些结构,除了传统的CNN之外,最近引起了人们的注意,并提出了许多方法。作为参数效率和性能与地点和图像识别等级相结合的一种结构,我们提议GSwin,它将Swin变换器和(多头)GMLP这两个流结合起来。我们显示,我们的GSwin在三种视觉任务、图像分类、对象探测和语义分割方面,比Swin Transforer(规模较小)的Swin Transforder(规模较小)更准确。