学习如何评价多模式语义本地化的绩效 (Learning to Evaluate Performance of Multi-modal Semantic Localization)

Semantic localization (SeLo) refers to the task of obtaining the most relevant locations in large-scale remote sensing (RS) images using semantic information such as text. As an emerging task based on cross-modal retrieval, SeLo achieves semantic-level retrieval with only caption-level annotation, which demonstrates its great potential in unifying downstream tasks. Although SeLo has been carried out successively, but there is currently no work has systematically explores and analyzes this urgent direction. In this paper, we thoroughly study this field and provide a complete benchmark in terms of metrics and testdata to advance the SeLo task. Firstly, based on the characteristics of this task, we propose multiple discriminative evaluation metrics to quantify the performance of the SeLo task. The devised significant area proportion, attention shift distance, and discrete attention distance are utilized to evaluate the generated SeLo map from pixel-level and region-level. Next, to provide standard evaluation data for the SeLo task, we contribute a diverse, multi-semantic, multi-objective Semantic Localization Testset (AIR-SLT). AIR-SLT consists of 22 large-scale RS images and 59 test cases with different semantics, which aims to provide a comprehensive evaluations for retrieval models. Finally, we analyze the SeLo performance of RS cross-modal retrieval models in detail, explore the impact of different variables on this task, and provide a complete benchmark for the SeLo task. We have also established a new paradigm for RS referring expression comprehension, and demonstrated the great advantage of SeLo in semantics through combining it with tasks such as detection and road extraction. The proposed evaluation metrics, semantic localization testsets, and corresponding scripts have been open to access at github.com/xiaoyuan1996/SemanticLocalizationMetrics .

翻译：语义本地化 (SeLo) 是指利用文本等语义信息获取大规模遥感图像中最相关位置的任务。作为基于跨模式检索的一项新兴任务,SeLo只实现语义级检索,只有字幕级注释,这表明其在统一下游任务方面具有巨大的潜力。虽然SeLo是连续进行的,但目前没有进行系统探讨和分析这一紧迫方向。在本文件中,我们深入研究了这个域,并提供了用于推进SeLo任务的衡量和测试数据的完整基准。首先,根据这项任务的特性,我们提出了多种歧视性评价指标,以量化SeLo任务的业绩。设计了相当大的区域比例、关注转移距离和分散的注意距离,以评价生成的SeLoo地图,从像素级和地区级。接下来,为SeLO任务提供标准的完整评估数据。我们为SeLO任务提供了多样化、多层次、多目的、多目的的语义本地化的表达方式和测试。SeIRSLT在S-real Restreal 中, 将一个大范围的运行测试,最后将一个测试了22个目标,我们S-SLSL-SLALLLLLLL 。