Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
翻译:在深层学习中,变异器已成为最重要的建筑创新之一,并在过去几年中实现了许多突破。 我们在这里建议了一个简单的无关注网络架构,即GMLP(GMLP),它完全以带刺的 MLP 为基础,并显示它既能在关键语言和视觉应用中发挥作用,也能在变异器上发挥作用。我们的比较表明,自我关注对于愿景变异器来说并不关键,因为GMLP可以达到同样的准确度。对于BERT来说,我们的模型在培训前的复杂度上与变异器实现了对等,并且在某些下游任务上也比较好。在微调任务上,GMLP表现得更差,使GMLP模型大得多可以缩小与变异器的距离。 总的来说,我们的实验表明,GMLP可以以及变异器在增加的数据和计算上进行比例。