Cross-view geo-localization aims to estimate the location of a query ground image by matching it to a reference geo-tagged aerial images database. As an extremely challenging task, its difficulties root in the drastic view changes and different capturing time between two views. Despite these difficulties, recent works achieve outstanding progress on cross-view geo-localization benchmarks. However, existing methods still suffer from poor performance on the cross-area benchmarks, in which the training and testing data are captured from two different regions. We attribute this deficiency to the lack of ability to extract the spatial configuration of visual feature layouts and models' overfitting on low-level details from the training set. In this paper, we propose GeoDTR which explicitly disentangles geometric information from raw features and learns the spatial correlations among visual features from aerial and ground pairs with a novel geometric layout extractor module. This module generates a set of geometric layout descriptors, modulating the raw features and producing high-quality latent representations. In addition, we elaborate on two categories of data augmentations, (i) Layout simulation, which varies the spatial configuration while keeping the low-level details intact. (ii) Semantic augmentation, which alters the low-level details and encourages the model to capture spatial configurations. These augmentations help to improve the performance of the cross-view geo-localization models, especially on the cross-area benchmarks. Moreover, we propose a counterfactual-based learning process to benefit the geometric layout extractor in exploring spatial information. Extensive experiments show that GeoDTR not only achieves state-of-the-art results but also significantly boosts the performance on same-area and cross-area benchmarks.
翻译:跨视图地理定位的目的是通过将地面图像与地理标记的参考航空图像数据库相匹配来估计查询图像的位置。 作为一个极具挑战性的任务,其困难的根源在于急剧的视图变化和两种观点之间的不同捕捉时间。尽管存在这些困难,最近的工作在跨视图地理定位基准方面取得了显著进展。然而,现有方法仍然由于跨区域基准的性能不佳而受到影响,在跨区域基准中,从两个不同区域获取了培训和测试数据。我们将这一缺陷归因于缺乏提取视觉特征布局的空间配置和模型过分适应培训集的低层细节的能力。在本文件中,我们建议GeoDTR将原始特征的几何信息明确分解,并学习空中和地面对配对的视觉特征之间的空间相关性,并有一个全新的地理定位布局提取模块。这个模块生成了一组地貌布局描述,对原始特征进行调控,并产生高质量的潜值表示。此外,我们只详细介绍了两类数据增强的空基空间布局框架和模型的模拟,在保持低水平的地理缩度细节时,而不是进行跨层的跨层的跨层分析。(二)Smannial-laisal-rographal-roisal-laisal-laisal-laisal-maisal-maisal-mais-maisal-maislislislislislislislislislislislislisl- laisal-ex-ex-sxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx