Image Retrieval is a fundamental task of obtaining images similar to the query one from a database. A common image retrieval practice is to firstly retrieve candidate images via similarity search using global image features and then re-rank the candidates by leveraging their local features. Previous learning-based studies mainly focus on either global or local image representation learning to tackle the retrieval task. In this paper, we abandon the two-stage paradigm and seek to design an effective single-stage solution by integrating local and global information inside images into compact image representations. Specifically, we propose a Deep Orthogonal Local and Global (DOLG) information fusion framework for end-to-end image retrieval. It attentively extracts representative local information with multi-atrous convolutions and self-attention at first. Components orthogonal to the global image representation are then extracted from the local information. At last, the orthogonal components are concatenated with the global representation as a complementary, and then aggregation is performed to generate the final representation. The whole framework is end-to-end differentiable and can be trained with image-level labels. Extensive experimental results validate the effectiveness of our solution and show that our model achieves state-of-the-art image retrieval performances on Revisited Oxford and Paris datasets.
翻译:图像检索是获取与数据库查询相似的图像的基本任务。 一个常见的图像检索做法是首先通过使用全球图像特征的类似搜索获取候选图像,然后通过利用本地特征重新排序候选人。 以往的学习基础研究主要侧重于全球或本地图像代表学习, 以解决检索任务。 在本文中, 我们放弃两阶段模式, 并寻求设计一个有效的单一阶段解决方案, 将本地和全球图像内部信息整合到压缩图像表达中。 具体地说, 我们提出一个用于端到端图像检索的深 Orthognal 地方和全球信息聚合框架( DOLG) 。 它会仔细提取具有代表性的本地信息, 首先是多振动共振和自省。 然后从本地信息中提取到全球图像代表的构件或图解调。 最终, 我们的组合组件与全球代表相配合, 然后进行汇总, 生成最终代表。 整个框架是端到端的, 可以接受图像级图像级标签的培训。 广度的实验性结果验证了我们国家解决方案的绩效, 并展示了我们 Statal- reformaisal 的模型的成绩, 。