How do the neural networks distinguish two images? It is of critical importance to understand the matching mechanism of deep models for developing reliable intelligent systems for many risky visual applications such as surveillance and access control. However, most existing deep metric learning methods match the images by comparing feature vectors, which ignores the spatial structure of images and thus lacks interpretability. In this paper, we present a deep interpretable metric learning (DIML) method for more transparent embedding learning. Unlike conventional metric learning methods based on feature vector comparison, we propose a structural matching strategy that explicitly aligns the spatial embeddings by computing an optimal matching flow between feature maps of the two images. Our method enables deep models to learn metrics in a more human-friendly way, where the similarity of two images can be decomposed to several part-wise similarities and their contributions to the overall similarity. Our method is model-agnostic, which can be applied to off-the-shelf backbone networks and metric learning methods. We evaluate our method on three major benchmarks of deep metric learning including CUB200-2011, Cars196, and Stanford Online Products, and achieve substantial improvements over popular metric learning methods with better interpretability. Code is available at https://github.com/wl-zhao/DIML
翻译:神经网络如何区分两种图像? 至关重要的是,要理解为许多风险视觉应用(如监视和访问控制)开发可靠智能系统开发可靠智能系统的深层模型的匹配机制。 然而,大多数现有的深度学习方法都通过比较特征矢量器来匹配图像,而特征矢量器忽略图像的空间结构,因而缺乏可解释性。 在本文中,我们提出了一个更透明的嵌入学习的深层次可解释的衡量学习(DIML)方法。与基于特征矢量比较的常规指标学习方法不同,我们提出了一个结构匹配战略,通过计算两种图像地貌图之间的最佳匹配流程,明确匹配空间嵌入。我们的方法使深层模型能够以更有利于人类的方式学习测量数据,使两种图像的相似性分解成若干部分相似性,并对总体相似性做出贡献。我们的方法是模型-不可知性,可以应用于离场的骨架网络和计量学习方法。我们评估了深层次计量学习的三大基准方法,包括CUB200-2011年、Cars196和斯坦-在线产品,并实现超越流行的测量/MLI的改进。