We tackle the low-efficiency flaw of vision transformer caused by the high computational/space complexity in Multi-Head Self-Attention (MHSA). To this end, we propose the Hierarchical MHSA (H-MHSA), whose representation is computed in a hierarchical manner. Specifically, our H-MHSA first learns feature relationships within small grids by viewing image patches as tokens. Then, small grids are merged into larger ones, within which feature relationship is learned by viewing each small grid at the preceding step as a token. This process is iterated to gradually reduce the number of tokens. The H-MHSA module is readily pluggable into any CNN architectures and amenable to training via backpropagation. We call this new backbone TransCNN, and it essentially inherits the advantages of both transformer and CNN. Experiments demonstrate that TransCNN achieves state-of-the-art accuracy for image recognition. Code and pretrained models are available at https://github.com/yun-liu/TransCNN. This technical report will keep updating by adding more experiments.
翻译:我们首先通过将图像补丁看成象征物来了解小网格的低效率缺陷,然后将小网格合并成大网格,通过在前一步将每个小网格看成一个象征物来学习特征关系。为此,我们提议采用等级制MHSA(H-MHSA)模块,以分级方式计算其代表性。具体地说,我们的H-MHSA(H-MHSA)首先通过将图像补丁看成象征物来学习小网格的低效率缺陷;然后,将小网格合并成大网格,其中通过在前一步将每个小网格看成一个象征物来学习特征关系。这个程序是循环,以逐步减少标志的数量。H-MHSA模块很容易插入CNN的任何结构中,并且可以通过反向调整来进行培训。我们称之为这个新的主干网,它基本上继承了变异器和CNN的优势。实验表明TransCNN在图像识别方面达到最先进的精确度。在https://github.com/yun-liu/transtransnCNN.这一技术报告将不断更新,通过添加更多的实验来更新。这个技术报告将不断更新。