We introduce a new landmark recognition dataset, which is created with a focus on fair worldwide representation. While previous work proposes to collect as many images as possible from web repositories, we instead argue that such approaches can lead to biased data. To create a more comprehensive and equitable dataset, we start by defining the fair relevance of a landmark to the world population. These relevances are estimated by combining anonymized Google Maps user contribution statistics with the contributors' demographic information. We present a stratification approach and analysis which leads to a much fairer coverage of the world, compared to existing datasets. The resulting datasets are used to evaluate computer vision models as part of the the Google Landmark Recognition and RetrievalChallenges 2021.
翻译:我们引入了一个新的里程碑式识别数据集,该数据集的创建重点是公平的全球代表性。虽然先前的工作提议从网络储存库收集尽可能多的图像,但我们认为,这种做法可能导致数据偏差。为了创建更加全面和公平的数据集,我们首先界定一个里程碑式数据集对世界人口的公平相关性。这些相关性是通过匿名的谷歌地图用户贡献统计数据与贡献者的人口信息相结合来估计的。我们提出了一个分层法和分析,与现有数据集相比,它导致对世界的覆盖更加公平。由此产生的数据集被用来评价计算机愿景模型,作为谷歌地标识别和检索搜索网络2021年的一部分。