Recently, the vision transformer (ViT) has made breakthroughs in image recognition. Its self-attention mechanism (MSA) can extract discriminative labeling information of different pixel blocks to improve image classification accuracy. However, the classification marks in their deep layers tend to ignore local features between layers. In addition, the embedding layer will be fixed-size pixel blocks. Input network Inevitably introduces additional image noise. To this end, we study a data augmentation vision transformer (DAVT) based on data augmentation and proposes a data augmentation method for attention cropping, which uses attention weights as the guide to crop images and improve the ability of the network to learn critical features. Secondly, we also propose a hierarchical attention selection (HAS) method, which improves the ability of discriminative markers between levels of learning by filtering and fusing labels between levels. Experimental results show that the accuracy of this method on the two general datasets, CUB-200-2011, and Stanford Dogs, is better than the existing mainstream methods, and its accuracy is 1.4\% and 1.6\% higher than the original ViT, respectively
翻译:最近,视觉变压器(VIT)在图像识别方面取得了突破。 它的自我注意机制(MSA)可以提取不同像素块的歧视性标签信息, 以提高图像分类的准确性。 但是, 深层的分类标记往往忽略了各层之间的本地特征。 此外, 嵌入层将是固定大小的像素块。 输入网络不可避免地会引入额外的图像噪音。 为此, 我们根据数据增强来研究数据增强视觉变压器( DAVT), 并提议一种关注裁剪裁的数据增强方法, 该方法将注意力重量作为作物图像的指南, 并提高网络学习关键特征的能力。 其次, 我们还建议一种等级关注选择( HAS) 方法, 该方法通过过滤和在不同层次之间使用标签来提高不同层次的学习水平之间的歧视性标记能力。 实验结果显示, 在两个普通数据集( CUB- 200- 2011) 和斯坦福狗上,该方法的准确性都比现有的主流方法要好, 其准确性比原始VIT分别高出 1.4 和 1.6 。