Current researches indicate that inductive bias (IB) can improve Vision Transformer (ViT) performance. However, they introduce a pyramid structure concurrently to counteract the incremental FLOPs and parameters caused by introducing IB. This structure destroys the unification of computer vision and natural language processing (NLP) and complicates the model. We study an NLP model called LSRA, which introduces IB with a pyramid-free structure. We analyze why it outperforms ViT, discovering that introducing IB increases the share of high-frequency data in each layer, giving "attention" to more information. As a result, the heads notice more diverse information, showing better performance. To further explore the potential of transformers, we propose EIT, which Efficiently introduces IB to ViT with a novel decreasing convolutional structure under a pyramid-free structure. EIT achieves competitive performance with the state-of-the-art (SOTA) methods on ImageNet-1K and achieves SOTA performance over the same scale models which have the pyramid-free structure.
翻译:目前的研究显示,感应偏差(IB)可以改善视觉变异器(VIT)的性能,但是,它们同时引入一个金字塔结构,以抵消因采用IB而增加的FLOP和参数。这一结构摧毁了计算机视觉和自然语言处理的统一性,并使模型复杂化。我们研究了名为LSRA的NLP模式,该模式以无金字塔结构引入了IB。我们分析了为什么它比VIT表现得更好,发现采用IB增加了高频数据在每一层中的比例,从而“注意”了更多的信息。结果,头部注意到了更加多样化的信息,表现出更好的性能。为了进一步探索变异器的潜力,我们建议经济转型型,在无金字塔结构下有效地将IB引入VIT,其新型的演进结构在无金字塔结构下不断下降。经济转型期在图像网-1K上以最先进的方法取得竞争性的性能,并在具有金字塔结构的同一规模模型上取得SOTA的性能。