Fine-grained image classification is to recognize hundreds of subcategories belonging to the same basic-level category, which is a highly challenging task due to the quite subtle visual distinctions among similar subcategories. Most existing methods generally learn part detectors to discover discriminative regions for better performance. However, not all localized parts are beneficial and indispensable for classification, and the setting for number of part detectors relies heavily on prior knowledge as well as experimental results. As is known to all, when we describe the object of an image into text via natural language, we only focus on the pivotal characteristics, and rarely pay attention to common characteristics as well as the background areas. This is an involuntary transfer from human visual attention to textual attention, which leads to the fact that textual attention tells us how many and which parts are discriminative and significant. So textual attention of natural language descriptions could help us to discover visual attention in image. Inspired by this, we propose a visual-textual attention driven fine-grained representation learning (VTA) approach, and its main contributions are: (1) Fine-grained visual-textual pattern mining devotes to discovering discriminative visual-textual pairwise information for boosting classification through jointly modeling vision and text with generative adversarial networks (GANs), which automatically and adaptively discovers discriminative parts. (2) Visual-textual representation learning jointly combine visual and textual information, which preserves the intra-modality and inter-modality information to generate complementary fine-grained representation, and further improve classification performance. Experiments on the two widely-used datasets demonstrate the effectiveness of our VTA approach, which achieves the best classification accuracy.
翻译:精细图像分类是承认属于同一基本类的数百个亚类,由于类似亚类的视觉差异非常微妙,这是一项极具挑战性的任务。大多数现有方法一般都学习部分探测器,以发现歧视区域,提高性能。然而,并非所有局部部分都对分类有益且不可或缺,部分探测器数量的设置都严重依赖先前的知识以及实验结果。众所周知,当我们通过自然语言将图像标本描述成文字时,我们只关注关键特征,很少关注共同特征和背景领域。这是人类视觉关注的非自愿补充性转变,转向文字关注,这导致文字关注告诉我们有多少部分是歧视性的,哪些部分是歧视性的。因此,自然语言描述的文字关注可以帮助我们在图像中发现视觉关注。受此启发,我们建议以视觉-文字驱动的精细细度表述学习(VTA)方法,而其主要贡献是:(1) 视觉-文字模型的精细度采矿,将视觉-视觉-视觉-视觉-视觉-视觉-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-形象-感