We introduce AmsterTime: a challenging dataset to benchmark visual place recognition (VPR) in presence of a severe domain shift. AmsterTime offers a collection of 2,500 well-curated images matching the same scene from a street view matched to historical archival image data from Amsterdam city. The image pairs capture the same place with different cameras, viewpoints, and appearances. Unlike existing benchmark datasets, AmsterTime is directly crowdsourced in a GIS navigation platform (Mapillary). We evaluate various baselines, including non-learning, supervised and self-supervised methods, pre-trained on different relevant datasets, for both verification and retrieval tasks. Our result credits the best accuracy to the ResNet-101 model pre-trained on the Landmarks dataset for both verification and retrieval tasks by 84% and 24%, respectively. Additionally, a subset of Amsterdam landmarks is collected for feature evaluation in a classification task. Classification labels are further used to extract the visual explanations using Grad-CAM for inspection of the learned similar visuals in a deep metric learning models.
翻译:我们引入了AmsterTime: 具有挑战性的数据集, 用以在有严格的域变的情况下对视觉定位进行基准化。 AmersterTime 提供了2500张与来自阿姆斯特丹市的历史档案图像数据相匹配的、与街头景相匹配的、与来自阿姆斯特丹市的历史档案图像数据相匹配的图像。 图像配对以不同的相机、观点和外观捕捉同一地点。 与现有的基准数据集不同, AmersterTime是一个地理信息系统导航平台(Mapilly)的直接众源。 我们评估了各种基线,包括非学习、受监管和自我监督的方法,为核查和检索任务对不同的相关数据集进行了预先培训。 我们的结果将Landmarks数据集预先培训的ResNet- 101模型的最佳准确性分别归功于84%和24%的核查和检索任务。 此外, 阿姆斯特丹地标集集集集集用于在分类任务中进行特征评估。 分类标签还被用来利用Grad- CAM来提取视觉解释, 以检查深层次指标学习模型中学到的类似视觉。