In this paper, we conduct a study on the state-of-the-art methods for text-to-image synthesis and propose a framework to evaluate these methods. We consider syntheses where an image contains a single or multiple objects. Our study outlines several issues in the current evaluation pipeline: (i) for image quality assessment, a commonly used metric, e.g., Inception Score (IS), is often either miscalibrated for the single-object case or misused for the multi-object case; (ii) for text relevance and object accuracy assessment, there is an overfitting phenomenon in the existing R-precision (RP) and Semantic Object Accuracy (SOA) metrics, respectively; (iii) for multi-object case, many vital factors for evaluation, e.g., object fidelity, positional alignment, counting alignment, are largely dismissed; (iv) the ranking of the methods based on current metrics is highly inconsistent with real images. To overcome these issues, we propose a combined set of existing and new metrics to systematically evaluate the methods. For existing metrics, we offer an improved version of IS named IS* by using temperature scaling to calibrate the confidence of the classifier used by IS; we also propose a solution to mitigate the overfitting issues of RP and SOA. For new metrics, we develop counting alignment, positional alignment, object-centric IS, and object-centric FID metrics for evaluating the multi-object case. We show that benchmarking with our bag of metrics results in a highly consistent ranking among existing methods that is well-aligned with human evaluation. As a by-product, we create AttnGAN++, a simple but strong baseline for the benchmark by stabilizing the training of AttnGAN using spectral normalization. We also release our toolbox, so-called TISE, for advocating fair and consistent evaluation of text-to-image models.
翻译:在本文中,我们研究了文本到图像合成的最新方法,并提出了评估这些方法的框架。我们考虑了图像包含单一或多个对象的合成。我们的研究概述了当前评价管道中的若干问题:(一) 图像质量评估,即常用的度量,例如,“感知分数”(IS),往往不是为单点情况进行错误校正,就是为多点数据错误校正;(二) 文本稳定性和目标精确度评估,现有R-精确度(RP)和Semical 对象精确度(SOA)指标中存在一种超标现象;(三) 对于多点情况,许多评价的至关重要因素,例如,目标性、定位校正、校正(IIS),我们用SIS的比标定比值,我们用SIS的比标比值,我们用SIS的比值比值,我们用SIS的比值比值比值,我们用SIS的比值比值,我们用SIS的比值比值比比标的。