Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
翻译:图像地理定位是预测某张照片来源地理坐标的艰巨任务。 依靠将视觉线索与一般世界知识相结合的能力来做出跨地理学的准确预测, 图像地理定位是一个尚未解决的问题。 我们展示了$\href{https://huggingface.co/geolocal/StreetCLIP\Text{StreetCLLIP}}}Stext{StreetCLLIP}$, 这是一种强大、 公开的基础模型, 不仅在多张开放图像地理定位基准上实现了最先进的性能, 而且在零点设置上也这样做, 展示了在400万多图像上受过良好监督的模型。 我们的方法引入了一种元学习方法, 将CLIP 的通用零弹能力有效转移到图像地理定位领域, 在固定的班级上不微调StretlCLIP, 改进了全局性零弹性能 。