Current researches indicate that the introduction of inductive bias (IB) can improve the performance of Vision Transformer (ViT). However, they introduce a pyramid structure at the same time to counteract the incremental FLOPs and parameters caused by introducing IB. A structure like this destroys the unification of computer vision and natural language processing (NLP) and is also unsuitable for pixel-level tasks. We study an NLP model called LSRA, which introduces IB under a pyramid-free structure with fewer FLOPs and parameters. We analyze why it outperforms ViT, discovering that introducing IB increases the share of high-frequency data (HFD) in each layer, giving 'attention' more comprehensive information. As a result, the head notices more diverse information, showing increased diversity of the head-attention distances (Head Diversity). However, the way LSRA introduced IB is inefficient. To further improve the HFD share, increase the Head Diversity, and explore the potential of transformers, we propose EIT. EIT Efficiently introduces IB to ViT with a novel decreasing convolutional structure and a pyramid-free structure. In four small-scale datasets, EIT has an accuracy improvement of 13% on average with fewer parameters and FLOPs than ViT. In the ImageNet-1K, EIT achieves 70%, 78%, 81% and 82% Top-1 accuracy with 3.5M, 8.9M, 16M and 25M parameters, respectively, which are competitive with the representative state-of-the-art (SOTA) methods. In particular, EIT achieves SOTA performance over other models which have a pyramid-free structure. Finally, ablation studies show that EIT does not require position embedding, which offers the possibility of simplified adaptation to more visual tasks without the need to redesign embedding.
翻译:目前的研究显示,引入感官偏差(IB)可以改善视觉变异器的性能。 但是,它们同时引入一个金字塔结构,以抵消引入 IB 带来的递增 FLOP 和参数。 这样的结构摧毁了计算机视觉和自然语言处理的统一性(NLP), 并且也不适合像素级任务。 我们研究名为 LSRA 的NLP 模式, 它在无金字塔结构下引入 IB, 并且减少FLOP 和参数。 我们分析为什么它优于 ViT, 发现引入 IB 会增加每个层的高频数据(HFD)的比例, 并同时提供“注意”的更全面信息。 因此, 这样的结构会发出更多样化的信息, 显示头部距离( Check 多样性) 的增加。 但是, LSRA 引入 IB 的方式效率不高。 为了进一步提高HFD 份额, 增加FLO - 1 和 探索变异变器的潜力, 我们提议 经济转型期 向VIT 高效地介绍IB, 新的CO- OP 和高频值结构 分别不断下降结构, 的SOLM 需要一个更简化结构, 18 和低等 的 的 和低级的 的 的 需要一个更简化的SIFIFTIM 和低级的 数据 的 的 的 和低级 。