Fine-grained visual classification (FGVC) is becoming an important research field, due to its wide applications and the rapid development of computer vision technologies. The current state-of-the-art (SOTA) methods in the FGVC usually employ attention mechanisms to first capture the semantic parts and then discover their subtle differences between distinct classes. The channel-spatial attention mechanisms, which focus on the discriminative channels and regions simultaneously, have significantly improved the classification performance. However, the existing attention modules are poorly guided since part-based detectors in the FGVC depend on the network learning ability without the supervision of part annotations. As obtaining such part annotations is labor-intensive, some visual localization and explanation methods, such as gradient-weighted class activation mapping (Grad-CAM), can be utilized for supervising the attention mechanism. We propose a Grad-CAM guided channel-spatial attention module for the FGVC, which employs the Grad-CAM to supervise and constrain the attention weights by generating the coarse localization maps. To demonstrate the effectiveness of the proposed method, we conduct comprehensive experiments on three popular FGVC datasets, including CUB-$200$-$2011$, Stanford Cars, and FGVC-Aircraft datasets. The proposed method outperforms the SOTA attention modules in the FGVC task. In addition, visualizations of feature maps also demonstrate the superiority of the proposed method against the SOTA approaches.
翻译:精细视觉分类(FGVC)由于应用广泛和计算机视觉技术的迅速发展,正在成为一个重要的研究领域。目前FGVC中最先进的技术(SOTA)方法通常使用关注机制,首先捕捉语义部分,然后发现不同类别之间的微妙差异。频道空间关注机制同时侧重于歧视性渠道和区域,大大改善了分类性能。但是,现有关注模块没有很好地指导,因为FGVC中的部分基于检测器在不受部分说明监督的情况下取决于网络学习能力。由于获得这种部分说明是劳动密集型的,一些视觉本地化和解释方法,例如梯度加权类激活绘图(Grad-CAM),可以用来监督关注机制。我们建议为FGVC建立一个G-C引导的频道空间关注模块,该模块使用格拉德-CAM来监督和限制关注权重,通过生成可分析的本地化地图。为了展示拟议方法的有效性,我们还在三种通用的SUBC-GFGSFSFC格式任务中,对拟议的SUC-GFG格式数据模型进行全面实验。