Small inter-class and large intra-class variations are the main challenges in fine-grained visual classification. Objects from different classes share visually similar structures and objects in the same class can have different poses and viewpoints. Therefore, the proper extraction of discriminative local features (e.g. bird's beak or car's headlight) is crucial. Most of the recent successes on this problem are based upon the attention models which can localize and attend the local discriminative objects parts. In this work, we propose a training method for visual attention networks, Coarse2Fine, which creates a differentiable path from the input space to the attended feature maps. Coarse2Fine learns an inverse mapping function from the attended feature maps to the informative regions in the raw image, which will guide the attention maps to better attend the fine-grained features. We show Coarse2Fine and orthogonal initialization of the attention weights can surpass the state-of-the-art accuracies on common fine-grained classification tasks.
翻译:小类之间和大类内部差异是细细视觉分类的主要挑战。不同类别中具有视觉相似结构和对象的不同对象在同类中具有不同的外形和观点。 因此,正确提取歧视性的地方特征(如鸟嘴或汽车头灯)至关重要。 这一问题最近的成功大多基于关注模型,这些模型可以将地方性地方化并处理地方性对象部分。 在这项工作中,我们提出了视觉关注网络的培训方法,即Coarse2Fine,它从输入空间到参与的地貌图,可以产生不同的路径。 Coarse2Fine从参与的地貌地图到原始图像中的信息丰富的区域学习了反向映射功能,它将引导关注地图更好地关注细微特征。 我们展示了Corarse2Fine 和对关注重量的条形初始化可以超过常见精细分类任务中的最新技术。