Deeply learned representations have achieved superior image retrieval performance in a retrieve-then-rerank manner. Recent state-of-the-art single stage model, which heuristically fuses local and global features, achieves promising trade-off between efficiency and effectiveness. However, we notice that efficiency of existing solutions is still restricted because of their multi-scale inference paradigm. In this paper, we follow the single stage art and obtain further complexity-effectiveness balance by successfully getting rid of multi-scale testing. To achieve this goal, we abandon the widely-used convolution network giving its limitation in exploring diverse visual patterns, and resort to fully attention based framework for robust representation learning motivated by the success of Transformer. Besides applying Transformer for global feature extraction, we devise a local branch composed of window-based multi-head attention and spatial attention to fully exploit local image patterns. Furthermore, we propose to combine the hierarchical local and global features via a cross-attention module, instead of using heuristically fusion as previous art does. With our Deep Attentive Local and Global modeling framework (DALG), extensive experimental results show that efficiency can be significantly improved while maintaining competitive results with the state of the arts.
翻译:深层学习的表达方式以检索和重新排序的方式实现了更高的图像检索性能。最近最先进的单一阶段模型(超自然结合当地和全球特点的单一阶段模型)在效率和有效性之间实现了有希望的权衡。然而,我们注意到,现有解决方案的效率仍然受到限制,因为其具有多尺度的推理模式。在本文中,我们遵循单一阶段的艺术,通过成功摆脱多级测试而获得进一步的复杂性和有效性平衡。为了实现这一目标,我们放弃了广泛使用的演动网络,在探索各种视觉模式方面受到限制,并依靠由变异器成功激发的以充分关注为基础的强有力代表学习框架。我们除了为全球地貌提取应用变异器外,我们还设计了一个由基于窗口的多头关注和空间关注组成的地方分支,以充分利用当地图像模式。此外,我们提议通过跨级模块将地方和全球等级特征结合起来,而不是像以往的艺术那样使用超级融合。随着我们深敏化的地方和全球模型框架(DALG)的广泛实验结果显示,效率可以大大提高,同时保持与艺术状态的竞争结果。