Evaluation metrics in image synthesis play a key role to measure performances of generative models. However, most metrics mainly focus on image fidelity. Existing diversity metrics are derived by comparing distributions, and thus they cannot quantify the diversity or rarity degree of each generated image. In this work, we propose a new evaluation metric, called `rarity score', to measure the individual rarity of each image synthesized by generative models. We first show empirical observation that common samples are close to each other and rare samples are far from each other in nearest-neighbor distances of feature space. We then use our metric to demonstrate that the extent to which different generative models produce rare images can be effectively compared. We also propose a method to compare rarities between datasets that share the same concept such as CelebA-HQ and FFHQ. Finally, we analyze the use of metrics in different designs of feature spaces to better understand the relationship between feature spaces and resulting sparse images. Code will be publicly available online for the research community.
翻译:图像合成中的评价指标在测量基因模型的性能方面发挥着关键作用。然而,大多数指标主要侧重于图像的忠诚性。现有的多样性指标是通过比较分布而得出的,因此无法量化每个生成图像的多样性或稀有度。在这项工作中,我们提出了一个新的评价指标,称为“分数”,以衡量通过基因模型合成的每个图像的个别差异性。我们首先从经验角度观察,共同样本彼此接近,在地貌空间的近邻距离中,稀有样本彼此相距甚远。我们然后使用我们的衡量标准,以证明不同基因模型产生稀有图像的程度能够有效地进行比较。我们还提出了一种方法,用以比较具有相同概念的数据集(如CelibA-HQ和FFHQ)之间的相对性。最后,我们分析了地貌空间不同设计中指标的使用情况,以更好地了解地貌空间与由此产生的稀有图像之间的关系。我们将在网上向研究界公开提供代码。