视觉-文字关注驱动式精巧代表制学习 (Visual-textual Attention Driven Fine-grained Representation Learning)

Fine-grained image classification is to recognize hundreds of subcategories belonging to the same basic-level category, which is a highly challenging task due to the quite subtle visual distinctions among similar subcategories. Most existing methods generally learn part detectors to discover discriminative regions for better performance. However, not all localized parts are beneficial and indispensable for classification, and the setting for number of part detectors relies heavily on prior knowledge as well as experimental results. As is known to all, when we describe the object of an image into text via natural language, we only focus on the pivotal characteristics, and rarely pay attention to common characteristics as well as the background areas. This is an involuntary transfer from human visual attention to textual attention, which leads to the fact that textual attention tells us how many and which parts are discriminative and significant. So textual attention of natural language descriptions could help us to discover visual attention in image. Inspired by this, we propose a visual-textual attention driven fine-grained representation learning (VTA) approach, and its main contributions are: (1) Fine-grained visual-textual pattern mining devotes to discovering discriminative visual-textual pairwise information for boosting classification through jointly modeling vision and text with generative adversarial networks (GANs), which automatically and adaptively discovers discriminative parts. (2) Visual-textual representation learning jointly combine visual and textual information, which preserves the intra-modality and inter-modality information to generate complementary fine-grained representation, and further improve classification performance. Experiments on the two widely-used datasets demonstrate the effectiveness of our VTA approach, which achieves the best classification accuracy.

翻译：精细图像分类是承认属于同一基本类的数百个亚类,由于类似亚类的视觉差异非常微妙,这是一项极具挑战性的任务。大多数现有方法一般都学习部分探测器,以发现歧视区域,提高性能。然而,并非所有局部部分都对分类有益且不可或缺,部分探测器数量的设置都严重依赖先前的知识以及实验结果。众所周知,当我们通过自然语言将图像标本描述成文字时,我们只关注关键特征,很少关注共同特征和背景领域。这是人类视觉关注的非自愿补充性转变,转向文字关注,这导致文字关注告诉我们有多少部分是歧视性的,哪些部分是歧视性的。因此,自然语言描述的文字关注可以帮助我们在图像中发现视觉关注。受此启发,我们建议以视觉-文字驱动的精细细度表述学习(VTA)方法,而其主要贡献是:(1) 视觉-文字模型的精细度采矿,将视觉-视觉-视觉-视觉-视觉-视觉-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-感

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

GRAPH-BERT ：学习图表示只需要注意力，GRAPH-BERT : Only Attention is Needed for Learning Graph Representations

专知会员服务

78+阅读 · 2020年5月31日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

自然语言处理中的注意力机制，Attention in Natural Language Processing

专知会员服务

136+阅读 · 2020年5月30日

【Google】大迁移：通用视觉表示学习，General Visual Representation Learning

专知会员服务

37+阅读 · 2020年5月9日