Recently, the vision transformer (ViT) has made breakthroughs in image recognition. Its self-attention mechanism (MSA) can extract discriminative labeling information of different pixel blocks to improve image classification accuracy. However, the classification marks in their deep layers tend to ignore local features between layers. In addition, the embedding layer will be fixed-size pixel blocks. Input network Inevitably introduces additional image noise. To this end, this paper studies a data augmentation vision transformer (DAVT) based on data augmentation and proposes a data augmentation method for attention cropping, which uses attention weights as the guide to crop images and improve the ability of the network to learn critical features. Secondly, this paper also proposes a hierarchical attention selection (HAS) method, which improves the ability of discriminative markers between levels of learning by filtering and fusing labels between levels. Experimental results show that the accuracy of this method on the two general datasets, CUB-200-2011, and Stanford Dogs, is better than the existing mainstream methods, and its accuracy is 1.4\% and 1.6\% higher than the original ViT, respectively.
翻译:最近,视觉变压器(VIT)在图像识别方面取得了突破。 它的自我注意机制(MSA)可以提取不同像素块的歧视性标签信息, 以提高图像分类的准确性。 但是, 深层的分类标记往往忽略了各层之间的本地特征。 此外, 嵌入层将是固定大小的像素块。 输入网络不可避免地会引入额外的图像噪音。 为此, 本文根据数据增强来研究数据增强视觉变压器( DAVT), 并提出了一种关注裁剪裁的数据增强方法, 这种方法将关注量作为作物图像的指南, 并提高网络学习关键特征的能力。 第二, 本文还提出了一种等级关注选择( HAS) 方法, 这种方法通过过滤和在不同层次使用标签来提高不同层次之间学习水平之间区分标记的能力。 实验结果表明, CUB- 200- 2011 和 Steford Dogs 两种普通数据集( CUB- 200- 2011) 和 Stefard Dogs) 的精确性都比现有主流方法要好, 准确性比原始VT 高出 1. 。