TransFG: 用于精细识别的变形结构 (TransFG: A Transformer Architecture for Fine-grained Recognition)

Fine-grained visual classification (FGVC) which aims at recognizing objects from subcategories is a very challenging task due to the inherently subtle inter-class differences. Recent works mainly tackle this problem by focusing on how to locate the most discriminative image regions and rely on them to improve the capability of networks to capture subtle variances. Most of these works achieve this by using an RPN module to propose bounding boxes and re-use the backbone network to extract features of selected boxes. Recently, vision transformer (ViT) shows its strong performance in the traditional classification task. The self-attention mechanism of the transformer links every patch token to the classification token. The strength of the attention link can be intuitively considered as an indicator of the importance of tokens. In this work, we propose a novel transformer-based framework TransFG where we integrate all raw attention weights of the transformer into an attention map for guiding the network to effectively and accurately select discriminative image patches and compute their relations. A duplicate loss is introduced to encourage multiple attention heads to focus on different regions. In addition, a contrastive loss is applied to further enlarge the distance between feature representations of similar sub-classes. We demonstrate the value of TransFG by conducting experiments on five popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, NABirds and iNat2017 where we achieve state-of-the-art performance. Qualitative results are presented for better understanding of our model.

翻译：精细的视觉分类(FGVC)旨在识别亚类对象,这是一项非常艰巨的任务,因为各等级之间有着内在的微妙差异。最近的工作主要通过侧重于如何定位最具歧视性的图像区域并依靠这些区域来提高网络捕捉微妙差异的能力来解决这一问题。这些工作大多通过使用RPN模块来提出捆绑框和重新使用主干网络来提取选定框的特征来实现这一点。最近,视觉变压器(VT)显示了其在传统分类任务中的强力表现。变压器的自我关注机制将每个补丁符号链接到分类符号。关注链接的力量可以被直观地视为标志的重要性指标。在这项工作中,我们提出了一个基于网络的新型变压器框架 TransfG,我们把变压器的所有原始关注量纳入关注地图,以指导网络有效和准确地选择具有歧视性的图像补补和调和它们之间的关系。引入了重复的损失,鼓励多头关注不同区域。此外,对比式的200级图像损失可以被直接地视为标志。我们进行更精确的C-BS-B-B-B-B-B-B-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-B-B-B-B-B-B-B-C-C-C-C-C-C-B-C-C-C-C-B-B-C-C-C-C-C-C-C-C-C-B-B-B-C-B-C-C-B-B-B-B-B-C-B-B-C-C-C-B-C-C-B-C-C-B-B-B-C-C-B-B-B-B-B-B-B-C-C-B-C-C-C-B-C-C-C-C-C-C-C-B-的相程的相程的相程的相程的相程的相程的相程的相程的相距比相程的相程的相程的相程的相距比相距比判的相程实验性标的