Fine-grained visual categorization (FGVC) aims at recognizing objects from similar subordinate categories, which is challenging and practical for human's accurate automatic recognition needs. Most FGVC approaches focus on the attention mechanism research for discriminative regions mining while neglecting their interdependencies and composed holistic object structure, which are essential for model's discriminative information localization and understanding ability. To address the above limitations, we propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning to contain both the appearance information and structure information. Specifically, we encode the image into a sequence of patch tokens and build a strong vision transformer framework with two well-designed modules: (i) the structure information learning (SIL) module is proposed to mine the spatial context relation of significant patches within the object extent with the help of the transformer's self-attention weights, which is further injected into the model for importing structure information; (ii) the multi-level feature boosting (MFB) module is introduced to exploit the complementary of multi-level features and contrastive learning among classes to enhance feature robustness for accurate recognition. The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily, which only depends on the attention weights that come with the vision transformer itself. Extensive experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks. The code is available at https://github.com/PKU-ICST-MIPL/SIM-Trans_ACMMM2022.
翻译:精细视觉分类(FG20-Trans)旨在识别类似从属类别的物体,这对于人类准确的自动识别需求具有挑战性和实用性;大多数FGVC方法侧重于对歧视性区域采矿的注意机制研究,同时忽视其相互依存性和整体对象结构,这是模型歧视性信息本地化和理解能力所必不可少的。为解决上述局限性,我们提议结构信息模型变异器(SIM-20-Trans)将目标结构信息纳入变异器,以加强具有歧视性的显示学习,同时包含外观信息和结构信息。具体地说,我们将图像编码成一个补丁符号序列,并建立一个强有力的视觉变异器框架,有两个设计良好的模块:(一) 结构信息学习模块,以便在变异器内部的自我保护权重(SIM-Transil-Trading-Mil-Mil-ML)值(SIM-ML-I)值(SIM-I)值结构结构模型,用于利用多层次变异式变异式模型的互补性,在SIM-FL-FL-S-S-L-L-S-SL-SL-L-SL-S-SL-SL-SL-SL-SL-SL-SD-I-I-I-I-I-I-SL-SL-I-I-I-I-I-SL-SL-S-S-S-S-S-I-I-I-S-S-S-S-S-I-I-I-I-I-I-I-SD-I-I-SD-SD-I-I-S-S-S-S-I-I-I-I-I-S-I-I-I-S-I-I-I-I-S-S-I-I-I-I-S-S-S-S-S-S-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-S-S-S-I-I-S-S-S-I-S-S-I-I-I-I-S-S