Cross-View Geo-Localisation is still a challenging task where additional modules, specific pre-processing or zooming strategies are necessary to determine accurate positions of images. Since different views have different geometries, pre-processing like polar transformation helps to merge them. However, this results in distorted images which then have to be rectified. Adding hard negatives to the training batch could improve the overall performance but with the default loss functions in geo-localisation it is difficult to include them. In this article, we present a simplified but effective architecture based on contrastive learning with symmetric InfoNCE loss that outperforms current state-of-the-art results. Our framework consists of a narrow training pipeline that eliminates the need of using aggregation modules, avoids further pre-processing steps and even increases the generalisation capability of the model to unknown regions. We introduce two types of sampling strategies for hard negatives. The first explicitly exploits geographically neighboring locations to provide a good starting point. The second leverages the visual similarity between the image embeddings in order to mine hard negative samples. Our work shows excellent performance on common cross-view datasets like CVUSA, CVACT, University-1652 and VIGOR. A comparison between cross-area and same-area settings demonstrate the good generalisation capability of our model.
翻译:跨视图地理定位仍然是一个具有挑战性的任务,需要额外的模块、特定的预处理或缩放策略才能确定图像的精确位置。由于不同视角具有不同的几何特征,预处理如极坐标变换有助于合并它们。然而,这会导致图像的失真,因此必须进行矫正。将困难负样本添加到训练批次中可以改善整体性能,但是在地理定位的默认损失函数中很难将它们包含在内。本文介绍了一种简化但有效的架构,基于对称的信息归一化估计损失函数,优于现有技术水平的结果。我们的框架由一个窄的训练管道组成,消除了使用聚合模块的需要,避免了进一步的预处理步骤,甚至增加了模型对未知区域的泛化能力。我们介绍了两种困难负采样策略。第一种明确利用地理相邻位置提供良好的起点。第二种利用图像嵌入之间的视觉相似性挖掘困难负样本。我们的工作在常见的跨视图数据集(如CVUSA、CVACT、University-1652和VIGOR)上表现出优秀的性能。跨区域和同区域设置之间的比较显示了我们模型的良好泛化能力。