In recent years, Text-to-Image (T2I) models have been extensively studied, especially with the emergence of diffusion models that achieve state-of-the-art results on T2I synthesis tasks. However, existing benchmarks heavily rely on subjective human evaluation, limiting their ability to holistically assess the model's capabilities. Furthermore, there is a significant gap between efforts in developing new T2I architectures and those in evaluation. To address this, we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on limited aspects, HRS-Bench measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes. We evaluate nine recent large-scale T2I models using metrics that cover a wide range of skills. A human evaluation aligned with 95% of our evaluations on average was conducted to probe the effectiveness of HRS-Bench. Our experiments demonstrate that existing models often struggle to generate images with the desired count of objects, visual text, or grounded emotions. We hope that our benchmark help ease future text-to-image generation research. The code and data are available at https://eslambakr.github.io/hrsbench.github.io
翻译:近年来,文本到图片(T2I)模型得到了广泛的研究,尤其是扩散模型的出现,这些模型在T2I合成任务中取得了最先进的结果。然而,现有的基准测试很大程度上依赖于主观的人类评价,限制了它们全面评估模型能力的能力。此外,在开发新的T2I架构和评估之间存在显着差距。为了解决这个问题,我们介绍了HRS-Bench,这是一个用于T2I模型的具体评估基准,它是Holistic、Reliable和Scalable的缩写。与现有的基准测试注重有限方面不同,HRS-Bench度量了13项技能,这些技能可以归类为5个主要类别:准确性、鲁棒性、概括能力、公平性和偏见。此外,HRS-Bench覆盖了50种场景,包括时装、动物、交通工具、食品和衣服。我们使用涵盖各种技能的指标评估了九个最近的大规模T2I模型。人类评估与我们评估的平均值达到了95%的一致性,以评估HRS-Bench的有效性。我们的实验表明,现有的模型经常难以生成具有所需数量的对象、视觉文本或基于情感的图像。我们希望我们的基准测试有助于减轻未来的文本到图像生成研究。该代码和数据可在https://eslambakr.github.io/hrsbench.github.io进行访问。