Fine-grained visual categorization (FGVC) aims to discriminate similar subcategories, whose main challenge is the large intraclass diversities and subtle inter-class differences. Existing FGVC methods usually select discriminant regions found by a trained model, which is prone to neglect other potential discriminant information. On the other hand, the massive interactions between the sequence of image patches in ViT make the resulting class-token contain lots of redundant information, which may also impacts FGVC performance. In this paper, we present a novel approach for FGVC, which can simultaneously make use of partial yet sufficient discriminative information in environmental cues and also compress the redundant information in class-token with respect to the target. Specifically, our model calculates the ratio of high-weight regions in a batch, adaptively adjusts the masking threshold and achieves moderate extraction of background information in the input space. Moreover, we also use the Information Bottleneck~(IB) approach to guide our network to learn a minimum sufficient representations in the feature space. Experimental results on three widely-used benchmark datasets verify that our approach can achieve outperforming performance than other state-of-the-art approaches and baseline models.
翻译:精细的视觉分类(FGVC)旨在歧视类似的亚类,其主要挑战是大型的阶级内部差异和微妙的阶级间差异。现有的FGVC方法通常选择通过经过培训的模式发现的差异区域,这容易忽视其他潜在的差异信息。另一方面,ViT图像补丁序列之间的大规模互动使得由此产生的类式图象包含大量多余信息,这可能也会影响FGVC的绩效。在本文中,我们为FGVC提出了一个新颖的方法,它既可以在环境提示中使用部分但足够的歧视性信息,也可以压缩课堂上与目标相关的多余信息。具体地说,我们的模型计算出高重量区域的比例,调整遮盖阈值,并在输入空间对背景资料进行适度提取。此外,我们还使用“信息博特勒内克~(IB)”方法指导我们的网络在地貌空间学习最起码的充分表述。三种广泛使用的基准模型的实验结果比其他基准模型的绩效核查我们实现的状态。