In this work, we address the problem of cross-view geo-localization, which estimates the geospatial location of a street view image by matching it with a database of geo-tagged aerial images. The cross-view matching task is extremely challenging due to drastic appearance and geometry differences across views. Unlike existing methods that predominantly fall back on CNN, here we devise a novel evolving geo-localization Transformer (EgoTR) that utilizes the properties of self-attention in Transformer to model global dependencies, thus significantly decreasing visual ambiguities in cross-view geo-localization. We also exploit the positional encoding of Transformer to help the EgoTR understand and correspond geometric configurations between ground and aerial images. Compared to state-of-the-art methods that impose strong assumption on geometry knowledge, the EgoTR flexibly learns the positional embeddings through the training objective and hence becomes more practical in many real-world scenarios. Although Transformer is well suited to our task, its vanilla self-attention mechanism independently interacts within image patches in each layer, which overlooks correlations between layers. Instead, this paper propose a simple yet effective self-cross attention mechanism to improve the quality of learned representations. The self-cross attention models global dependencies between adjacent layers, which relates between image patches while modeling how features evolve in the previous layer. As a result, the proposed self-cross attention leads to more stable training, improves the generalization ability and encourages representations to keep evolving as the network goes deeper. Extensive experiments demonstrate that our EgoTR performs favorably against state-of-the-art methods on standard, fine-grained and cross-dataset cross-view geo-localization tasks.
翻译:在这项工作中,我们解决了跨视图地理定位问题,通过地理标记图像数据库与地理标记图像数据库相匹配,估计了街头视图图像的地理空间位置。交叉视图匹配任务由于各种观点的外观和几何差异而极具挑战性。与主要倒在CNN上的现有方法不同,我们在这里设计了一个新颖的不断演变的地理定位变异器(EgoTR),它利用变异器中自我关注的特性来模拟全球依赖性,从而大大降低跨视图地理定位的视觉模糊性。我们还利用变异器的位置编码来帮助EgoTR理解和对应地面和空中图像之间的更深的几何配置。相比于对地测量知识作出强有力假设的先进方法,EgoTR灵活地学习了在培训目标中的定位嵌入,因此在许多现实世界情景中更加实用。虽然变异器非常适合我们的任务,但其香草自我调整机制独立地在图层中互动,从而忽略了地层与层之间的正相关关系。相反,本文建议了一个简单的地平面的自我解读方法,在学习的自我分析中可以将一个简单的地平面的自我理解的自我定位转换过程与自我定位转换。