The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via CNN-based approaches.However, these methods enhance the computational complexity and make the modeldominated by the regions containing the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA performance on general image recognition tasks. Theself-attention mechanism aggregates and weights the information from all patches to the classification token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation token in the deep layer pays more attention to the global information, lacking the local and low-level features that are essential for FGVC. In this work, we proposea novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)where we aggregate the important tokens from each transformer layer to compensate thelocal, low-level and middle-level information. We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens without introducing extra param-eters. We verify the effectiveness of FFVT on three benchmarks where FFVT achieves the state-of-the-art performance.
翻译:解决细微视觉分类(FGVC)的核心是学习微妙但有区别的特征。大多数以前的工作是通过明确选择歧视性部分或通过有线电视新闻网的方法整合关注机制来实现这一点。然而,这些方法提高了计算的复杂性,并使含有大多数物体的区域所主导的模型成为大多数物体的主导模式。最近,视觉转换(VIT)在一般图像识别任务上取得了SOTA的性能。自留机制将所有补丁的信息综合起来并加权到分类标志上,使之完全适合FGVC。然而,深层的分类化标志更加关注全球信息,缺乏对FGVC至关重要的本地和低级别特征。在此工作中,我们提出一个新的纯基于变压框架“变压变形变形变形变形器”(FFT),将每个变形层的重要象征汇总起来,以补偿当地、低级和中级信息。我们设计了一个新型的象征性选择模块,要求相互关注重量选择(MAWS),以指导网络的有效和低级别业绩,从而在不有效地选择歧视性象征性国家基准方面实现我们。