This paper aims to investigate representation learning for large scale visual place recognition, which consists of determining the location depicted in a query image by referring to a database of reference images. This is a challenging task due to the large-scale environmental changes that can occur over time (i.e., weather, illumination, season, traffic, occlusion). Progress is currently challenged by the lack of large databases with accurate ground truth. To address this challenge, we introduce GSV-Cities, a new image dataset providing the widest geographic coverage to date with highly accurate ground truth, covering more than 40 cities across all continents over a 14-year period. We subsequently explore the full potential of recent advances in deep metric learning to train networks specifically for place recognition, and evaluate how different loss functions influence performance. In addition, we show that performance of existing methods substantially improves when trained on GSV-Cities. Finally, we introduce a new fully convolutional aggregation layer that outperforms existing techniques, including GeM, NetVLAD and CosPlace, and establish a new state-of-the-art on large-scale benchmarks, such as Pittsburgh, Mapillary-SLS, SPED and Nordland. The dataset and code are available for research purposes at https://github.com/amaralibey/gsv-cities.
翻译:本文旨在调查大规模视觉定位识别的代表学习,这包括参照参考图像数据库,确定查询图像中显示的位置,这是一项具有挑战性的任务,因为随着时间的推移(即天气、光化、季节、交通、封闭性)可能发生的大规模环境变化(即天气、照明、季节、交通、交通、封闭性);目前,由于缺乏具有准确地面真相的大型数据库,进展受到挑战;为了应对这一挑战,我们引入了GSV-城市,这是一个新的图像数据集,提供迄今范围最广的地理覆盖,具有高度准确的地面真相,14年期间覆盖了各大洲40多个城市;随后,我们探索了在培训专门进行地点识别的网络的深度计量学习方面取得的最新进展的充分潜力,并评估了不同损失功能如何影响绩效;此外,我们表明,在对GSV-城市进行培训时,现有方法的绩效将大大改善;最后,我们引入一个新的完全革命的集合层,该层将现有技术(包括Gem、NetVLAD和CosPlace)超越现有技术,并在大规模基准上建立一个新的艺术状态,例如匹兹堡、Mably/Nocommals)/Noasimals。