Fine-grained visual classification (FGVC) which aims at recognizing objects from subcategories is a very challenging task due to the inherently subtle inter-class differences. Recent works mainly tackle this problem by focusing on how to locate the most discriminative image regions and rely on them to improve the capability of networks to capture subtle variances. Most of these works achieve this by using an RPN module to propose bounding boxes and re-use the backbone network to extract features of selected boxes. Recently, vision transformer (ViT) shows its strong performance in the traditional classification task. The self-attention mechanism of the transformer links every patch token to the classification token. The strength of the attention link can be intuitively considered as an indicator of the importance of tokens. In this work, we propose a novel transformer-based framework TransFG where we integrate all raw attention weights of the transformer into an attention map for guiding the network to effectively and accurately select discriminative image patches and compute their relations. A duplicate loss is introduced to encourage multiple attention heads to focus on different regions. In addition, a contrastive loss is applied to further enlarge the distance between feature representations of similar sub-classes. We demonstrate the value of TransFG by conducting experiments on five popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, NABirds and iNat2017 where we achieve state-of-the-art performance. Qualitative results are presented for better understanding of our model.


翻译:精细的视觉分类(FGVC)旨在识别亚类对象,这是一项非常艰巨的任务,因为各等级之间有着内在的微妙差异。最近的工作主要通过侧重于如何定位最具歧视性的图像区域并依靠这些区域来提高网络捕捉微妙差异的能力来解决这一问题。这些工作大多通过使用RPN模块来提出捆绑框和重新使用主干网络来提取选定框的特征来实现这一点。最近,视觉变压器(VT)显示了其在传统分类任务中的强力表现。变压器的自我关注机制将每个补丁符号链接到分类符号。关注链接的力量可以被直观地视为标志的重要性指标。在这项工作中,我们提出了一个基于网络的新型变压器框架 TransfG,我们把变压器的所有原始关注量纳入关注地图,以指导网络有效和准确地选择具有歧视性的图像补补和调和它们之间的关系。引入了重复的损失,鼓励多头关注不同区域。此外,对比式的200级图像损失可以被直接地视为标志。我们进行更精确的C-BS-B-B-B-B-B-B-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-B-B-B-B-B-B-B-C-C-C-C-C-C-B-C-C-C-C-B-B-C-C-C-C-C-C-C-C-C-B-B-B-C-B-C-C-B-B-B-B-B-C-B-B-C-C-C-B-C-C-B-C-C-B-B-B-C-C-B-B-B-B-B-B-B-C-C-B-C-C-C-B-C-C-C-C-C-C-C-B-的相程的相程的相程的相程的相程的相程的相程的相程的相距比相程的相程的相程的相程的相距比相距比判的相程实验性标的

0
下载
关闭预览

相关内容

最新《Transformers模型》教程,64页ppt
专知会员服务
308+阅读 · 2020年11月26日
零样本文本分类,Zero-Shot Learning for Text Classification
专知会员服务
95+阅读 · 2020年5月31日
Stabilizing Transformers for Reinforcement Learning
专知会员服务
59+阅读 · 2019年10月17日
[综述]深度学习下的场景文本检测与识别
专知会员服务
77+阅读 · 2019年10月10日
一文读懂Attention机制
机器学习与推荐算法
63+阅读 · 2020年6月9日
Hierarchically Structured Meta-learning
CreateAMind
26+阅读 · 2019年5月22日
Unsupervised Learning via Meta-Learning
CreateAMind
42+阅读 · 2019年1月3日
meta learning 17年:MAML SNAIL
CreateAMind
11+阅读 · 2019年1月2日
vae 相关论文 表示学习 1
CreateAMind
12+阅读 · 2018年9月6日
Hierarchical Disentangled Representations
CreateAMind
4+阅读 · 2018年4月15日
条件GAN重大改进!cGANs with Projection Discriminator
CreateAMind
8+阅读 · 2018年2月7日
Arxiv
11+阅读 · 2019年4月15日
Deep Comparison: Relation Columns for Few-Shot Learning
VIP会员
Top
微信扫码咨询专知VIP会员