TransFG: 用于精细识别的变形结构 (TransFG: A Transformer Architecture for Fine-grained Recognition)

Fine-grained visual classification (FGVC) which aims at recognizing objects from subcategories is a very challenging task due to the inherently subtle inter-class differences. Most existing works mainly tackle this problem by reusing the backbone network to extract features of detected discriminative regions. However, this strategy inevitably complicates the pipeline and pushes the proposed regions to contain most parts of the objects thus fails to locate the really important parts. Recently, vision transformer (ViT) shows its strong performance in the traditional classification task. The self-attention mechanism of the transformer links every patch token to the classification token. In this work, we first evaluate the effectiveness of the ViT framework in the fine-grained recognition setting. Then motivated by the strength of the attention link can be intuitively considered as an indicator of the importance of tokens, we further propose a novel Part Selection Module that can be applied to most of the transformer architectures where we integrate all raw attention weights of the transformer into an attention map for guiding the network to effectively and accurately select discriminative image patches and compute their relations. A contrastive loss is applied to enlarge the distance between feature representations of confusing classes. We name the augmented transformer-based model TransFG and demonstrate the value of it by conducting experiments on five popular fine-grained benchmarks where we achieve state-of-the-art performance. Qualitative results are presented for better understanding of our model.

翻译：精细的视觉分类(FGVC)旨在识别子类对象的精细视觉分类(FGVC)是一项非常具有挑战性的任务,因为各等级之间有着内在的微妙差异。大多数现有作品主要通过重新使用主干网网络来解决这一问题,以提取被检测到的歧视性区域的特点。然而,这一战略不可避免地使管道复杂化,并将拟议区域推向包含大部分对象,从而无法找到真正重要的部分。最近,视觉变压器(VIT)显示其在传统分类任务中的强效表现。变压器(VIT)的自我注意机制将每个补丁符号链接到分类符号上。在这项工作中,我们首先评估微细的识别设置中维特框架的有效性。随后,由于关注链接的力度,我们可以直截了当地地考虑将这一问题作为象征重要性的一个指标。我们进一步建议一个新的部分选择模块,将变压器的所有原始关注重量都整合成一个关注地图,用以指导网络有效和准确地选择有区别的图像补丁,并调整它们之间的关系。我们首先评估了维特框架框架框架的有效性。我们用对比性损失用来扩大五级的变压模型的变压模型,从而扩大了我们变压模型的变压模型的变压式模型的变压式模型。我们变压式模型的模型的变压式模型的变压式模型的变压式模型的变压式模型将了我们的变压式模型。