In this paper, we conduct a study on state-of-the-art methods for single- and multi-object text-to-image synthesis and propose a common framework for evaluating these methods. We first identify several common issues in the current evaluation of text-to-image models, which are: (i) a commonly used metric for image quality assessment, e.g., Inception Score (IS), is often either miscalibrated for the single-object case or misused for the multi-object case; (ii) the overfitting phenomenon appears in the existing R-precision (RP) and SOA metrics, which are used to assess text relevance and object accuracy aspects, respectively; (iii) many vital factors in the evaluation of the multi-object case are primarily dismissed, e.g., object fidelity, positional alignment, counting alignment; (iv) the ranking of the methods based on current metrics is highly inconsistent with real images. Then, to overcome these limitations, we propose a combined set of existing and new metrics to systematically evaluate the methods. For existing metrics, we develop an improved version of IS named IS* by using temperature scaling to calibrate the confidence of the classifier used by IS; we also propose a solution to mitigate the overfitting issues of RP and SOA. Regarding a set of new metrics compensating for the lacking of vital evaluating factors in the multi-object case, we develop CA for counting alignment, PA for positional alignment, object-centric IS (O-IS), object-centric FID (O-FID) for object fidelity. Our benchmark, therefore, results in a highly consistent ranking among existing methods, being well-aligned to human evaluation. We also create a strong baseline model (AttnGAN++) for the benchmark by a simple modification from the well-known AttnGAN. We will release this toolbox for unified evaluation, so-called TISE, to standardize the evaluation of the text-to-image synthesis models.
翻译:在本文中,我们研究了单一和多目标文本到模拟合成的最新精确度方法,并提出了评估这些方法的共同框架。我们首先在目前对文本到图像模型的评估中确定了几个共同问题,即:(一) 用于图像质量评估的常用度量,例如,“感知评”(IS),往往不是用于单项选择的错误校正,就是用于多目标案例的错误校正;(二) 现有目标精确度(RP)和SOA指标中出现超常现象,分别用于评估文本的相关性和对象准确性;(三) 在目前对文本到图像模型的评估中,我们首先找出了几个共同的问题:(一) 用于图像质量评估的常用度度量度,例如,“感知分数”(IS) 依据当前指标,基于现有指标的方法的排序与真实度高度不相符;(二) 为了克服这些局限,我们提议将现有和新指标用于系统评估目标目标的精确度(RP) 对现有指标定位基准值和对象基准值的精确度值的精确度值进行计算,因此,我们用“SIS”的精确比标为标准的升级,我们用SIS的比为标准。