Visual classification can be divided into coarse-grained and fine-grained classification. Coarse-grained classification represents categories with a large degree of dissimilarity, such as the classification of cats and dogs, while fine-grained classification represents classifications with a large degree of similarity, such as cat species, bird species, and the makes or models of vehicles. Unlike coarse-grained visual classification, fine-grained visual classification often requires professional experts to label data, which makes data more expensive. To meet this challenge, many approaches propose to automatically find the most discriminative regions and use local features to provide more precise features. These approaches only require image-level annotations, thereby reducing the cost of annotation. However, most of these methods require two- or multi-stage architectures and cannot be trained end-to-end. Therefore, we propose a novel plug-in module that can be integrated to many common backbones, including CNN-based or Transformer-based networks to provide strongly discriminative regions. The plugin module can output pixel-level feature maps and fuse filtered features to enhance fine-grained visual classification. Experimental results show that the proposed plugin module outperforms state-of-the-art approaches and significantly improves the accuracy to 92.77\% and 92.83\% on CUB200-2011 and NABirds, respectively. We have released our source code in Github https://github.com/chou141253/FGVC-PIM.git.
翻译:视觉分类可以分为粗略的、细微的、细微的分类。 粗略的分类代表不同程度的类别,例如猫和狗的分类,而细微的分类则代表在很大程度上相似的分类,例如猫种、鸟类物种和车辆的制造或型号。 与粗略的视觉分类不同,细微的视觉分类往往需要专业专家来标签数据,从而使数据更加昂贵。 为了应对这一挑战,许多方法建议自动找到最受歧视的区域,并使用地方特征来提供更精确的特征。这些方法只需要图像级别的说明,从而降低注释的成本。然而,这些方法大多需要两个或多阶段的结构,无法经过终端到终端培训。 因此,我们提议了一个新的插件模块,可以与许多常见的骨干相结合,包括CNN-CNN或以变压器为基础的网络,以提供强烈的区别性区域。 插件模块可以输出像素级的地貌地图和缩略式的功能,以便分别加强精细度-200-FIB- 和SLOAL-B-B-I-B-B-B-B-Ial-Ial-Ial-Iard-IG-Iard-Iard-Iard-Iard-Iard-IG-B-B-IGrod-B-IGroal-B-B-B-B-B-B-B-B-B-S-I-I-I-I_______________________BAR_BAR_BAR_BAR_BAR制制制制制式。我们结果显示结果显示结果, 和B_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR________________________________________BAR_BAR_BAR_BAR__BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_B_B_BAR_BAR_B_B_B__B_结果,