基于共视模式增强的生成式Transformer学习在汽车地理定位中的应用 (Co-visual pattern augmented generative transformer learning for automobile geo-localization)

Geolocation is a fundamental component of route planning and navigation for unmanned vehicles, but GNSS-based geolocation fails under denial-of-service conditions. Cross-view geo-localization (CVGL), which aims to estimate the geographical location of the ground-level camera by matching against enormous geo-tagged aerial (\emph{e.g.}, satellite) images, has received lots of attention but remains extremely challenging due to the drastic appearance differences across aerial-ground views. In existing methods, global representations of different views are extracted primarily using Siamese-like architectures, but their interactive benefits are seldom taken into account. In this paper, we present a novel approach using cross-view knowledge generative techniques in combination with transformers, namely mutual generative transformer learning (MGTL), for CVGL. Specifically, by taking the initial representations produced by the backbone network, MGTL develops two separate generative sub-modules -- one for aerial-aware knowledge generation from ground-view semantics and vice versa -- and fully exploits the entirely mutual benefits through the attention mechanism. Moreover, to better capture the co-visual relationships between aerial and ground views, we introduce a cascaded attention masking algorithm to further boost accuracy. Extensive experiments on challenging public benchmarks, \emph{i.e.}, {CVACT} and {CVUSA}, demonstrate the effectiveness of the proposed method which sets new records compared with the existing state-of-the-art models.

翻译：地理定位是无人车路线规划和导航的基本组成部分，但基于全球导航卫星系统（GNSS）的地理定位在服务被拒绝的情况下会失败。跨视图地理定位（CVGL）旨在通过与大量地理标记的航空（例如卫星）图像匹配，从而估计地面摄像机的地理位置，CVGL已受到广泛关注，但由于航空和地面视图之间存在巨大的外观差异，因此仍然极具挑战性。在现有方法中，使用类似Siamese的体系结构主要提取不同视图的全局表示，但很少考虑它们的交互效益。在本文中，我们提出了一种新方法，即相互生成Transformer学习（MGTL），将跨视图知识生成技术与Transformer相结合，用于CVGL。具体来说，MGTL利用由骨干网络产生的初始表示，发展了两个单独的生成子模块——一个用于从地面视图的语义中生成航空相关知识，另一个则反之——并通过注意机制充分利用完全相互的益处。此外，为了更好地捕捉航空和地面视图之间的共视关系，我们引入了一个级联注意掩蔽算法来进一步提高准确性。在具有挑战性的公共基准测试中，即{CVACT}和{CVUSA}，进行了大量实验，证明了所提出的方法的有效性，与现有最新模型相比，创造了新纪录。