Cross-view geo-localization aims to estimate the GPS location of a query ground-view image by matching it to images from a reference database of geo-tagged aerial images. To address this challenging problem, recent approaches use panoramic ground-view images to increase the range of visibility. Although appealing, panoramic images are not readily available compared to the videos of limited Field-Of-View (FOV) images. In this paper, we present the first cross-view geo-localization method that works on a sequence of limited FOV images. Our model is trained end-to-end to capture the temporal structure that lies within the frames using the attention-based temporal feature aggregation module. To robustly tackle different sequences length and GPS noises during inference, we propose to use a sequential dropout scheme to simulate variant length sequences. To evaluate the proposed approach in realistic settings, we present a new large-scale dataset containing ground-view sequences along with the corresponding aerial-view images. Extensive experiments and comparisons demonstrate the superiority of the proposed approach compared to several competitive baselines.
翻译:跨视图地理定位的目的是通过将查询地面图像与地理标记航空图像参考数据库图像相匹配,来估计查询地面图像的全球定位系统位置。为了解决这一具有挑战性的问题,最近的方法使用全景地面图像来提高可见度范围。虽然吸引人,但全景图像与有限视野图像的视频相比并不易获得。在本文中,我们展示了第一个跨视图地面图像定位方法,该方法在有限的视野图像序列上起作用。我们的模型经过培训,以利用基于注意的时间特征汇总模块来捕捉位于框架之内的时间结构。为了在推断过程中强有力地处理不同序列长度和全球定位系统噪音,我们提议使用顺序退出方案模拟变式序列。为了在现实环境中评估拟议方法,我们提出了一个新的大型数据集,其中包括地面视图序列和相应的航空视图图像。广泛的实验和比较表明拟议方法优于若干竞争性基线。