Predicting the geographic location (geo-localization) from a single ground-level RGB image taken anywhere in the world is a very challenging problem. The challenges include huge diversity of images due to different environmental scenarios, drastic variation in the appearance of the same location depending on the time of the day, weather, season, and more importantly, the prediction is made from a single image possibly having only a few geo-locating cues. For these reasons, most existing works are restricted to specific cities, imagery, or worldwide landmarks. In this work, we focus on developing an efficient solution to planet-scale single-image geo-localization. To this end, we propose TransLocator, a unified dual-branch transformer network that attends to tiny details over the entire image and produces robust feature representation under extreme appearance variations. TransLocator takes an RGB image and its semantic segmentation map as inputs, interacts between its two parallel branches after each transformer layer, and simultaneously performs geo-localization and scene recognition in a multi-task fashion. We evaluate TransLocator on four benchmark datasets - Im2GPS, Im2GPS3k, YFCC4k, YFCC26k and obtain 5.5%, 14.1%, 4.9%, 9.9% continent-level accuracy improvement over the state-of-the-art. TransLocator is also validated on real-world test images and found to be more effective than previous methods.
翻译:从世界上任何地方拍摄的单一地面水平 RGB 图像中预测地理位置(地理定位)是一个极具挑战性的问题。 挑战包括:不同环境情景造成的图像差异巨大,不同时间、天气、季节和更重要的是,同一地点的外观差异很大,取决于当天、天气、季节等,预测来自单一图像,可能只有几处地理定位提示。出于这些原因,大多数现有工程仅限于特定城市、图像或世界性地标。在这项工作中,我们侧重于为地球规模的单一图像地理定位制定高效解决方案。为此,我们提议 TransLoator,一个统一的双层变异器网络,覆盖整个图像的细小细节,并在极端外观变异的情况下产生强的特征代表。 Transloator使用RGB图像及其语义分布图作为投入,每个变异层后两个平行分支之间的互动,同时以多种任务方式进行地理定位和场景识别。 我们在四个基准数据集 - Im2GP-% Transloc- 的精确度为Ym2-% Trem2C- real- real- transal- fal-lax- broduction- bal- 14 kFC.