Deep convolutional neural networks (CNNs) have shown a strong ability in mining discriminative object pose and parts information for image recognition. For fine-grained recognition, context-aware rich feature representation of object/scene plays a key role since it exhibits a significant variance in the same subcategory and subtle variance among different subcategories. Finding the subtle variance that fully characterizes the object/scene is not straightforward. To address this, we propose a novel context-aware attentional pooling (CAP) that effectively captures subtle changes via sub-pixel gradients, and learns to attend informative integral regions and their importance in discriminating different subcategories without requiring the bounding-box and/or distinguishable part annotations. We also introduce a novel feature encoding by considering the intrinsic consistency between the informativeness of the integral regions and their spatial structures to capture the semantic correlation among them. Our approach is simple yet extremely effective and can be easily applied on top of a standard classification backbone network. We evaluate our approach using six state-of-the-art (SotA) backbone networks and eight benchmark datasets. Our method significantly outperforms the SotA approaches on six datasets and is very competitive with the remaining two.
翻译:深相神经网络(CNNs)显示在采矿中具有很强的能力,具有歧视性的物体的构成和部件信息,以便图像识别。对于细微的识别而言,环境觉察到的物体/cene的丰富特征代表具有关键作用,因为它在同一个子类中存在显著差异,不同亚类之间也存在微妙差异。发现物体/cene之间充分特征的细微差异并不是直截了当的。为了解决这个问题,我们提议建立一个新的环境觉察力集合(CAP),通过子像素梯度有效捕捉微妙的变化,并学习如何在不要求捆绑框和/或可辨别部分说明的情况下,参加信息丰富的整体区域及其在区分不同子类中的重要性。我们还引入了一个新特点,即考虑集成区域的信息性与其空间结构之间的内在一致性,以捕捉到它们之间的语系关联。我们的方法简单但非常有效,并且可以很容易在标准分类主干网顶上应用。我们用六个状态(SotA)的主干网和八个基准数据集来评估我们的方法。我们的方法与其余的6个有竞争力的6个数据外。