Ground-to-aerial geolocalization refers to localizing a ground-level query image by matching it to a reference database of geo-tagged aerial imagery. This is very challenging due to the huge perspective differences in visual appearances and geometric configurations between these two views. In this work, we propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture, which couples CNN-based local features with Transformer-based global representations for enhanced representation learning. Specifically, our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context from the CNN map. In particular, our Transformer head acts as a spatial-aware importance generator to select salient CNN features as the final feature representation. Such a coupling procedure allows us to leverage a lightweight Transformer network to greatly enhance the discriminative capability of the embedded features. Furthermore, we design a dual-branch Transformer head network to combine image features from multi-scale windows in order to improve details of the global feature representation. Extensive experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12\% and 84.92\% on CVUSA and CVACT_val, respectively, which outperforms the second-performing baseline with less than 50% parameters and almost 2x higher frame rate, therefore achieving a preferable accuracy-efficiency tradeoff.
翻译:地对空地理定位是指将地面查询图像与带有地理标记的空中图像参考数据库相匹配,从而将其本地化为地面查询图像。 这非常具有挑战性,因为这两种观点之间在视觉外观和几何配置上存在巨大的视野差异。 在这项工作中,我们提议建立一个新型的变压器引导神经共振网络(TransGCNN)架构,该架构将CNN的本地特征与基于变压器的全球代表机构相结合,以加强代表性学习。具体地说,我们的TransGCNNN由CNN主干网主干网的特征图从输入图像中提取,以及CNN地图上的变压器头模拟全球背景。特别是,我们的变压器头作为空间觉重要生成器,以选择突出CNN特征作为最终特征代表。这种组合程序使我们能够利用轻量的变压器网络来大大增强嵌入特征的歧视性能力。此外,我们设计了一个双排变压式变压器头网络,将多级窗口的图像特征组合起来,以便改进全球地貌代表的详情。 在广基建基准数据集上进行的大规模实验显示,我们的模型几乎达到C-1的精确度标准值的C- 和C- 12-92 和C-92- bas- basy-xxx 的精确度,从而分别达到94- bal- basx- basx 的精确度的精确度为94- bal- basx 和C- bal- bal- bal- bly- bly- bal- bal- bal- bal- basy- basyxxxx 和C- bal- balxxx 的精确度,因此达到 和C- bal- bal-x- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- bal- b- b- b- bal- bal- bal- bal- bal- bal- bal- bal- b