Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.
翻译:人类评估对于验证文本到图像生成模型的性能至关重要,因为这个高度认知过程需要深入理解文本和图像。然而我们对37篇发表的最近文章的调查表明,许多作品仅依赖于自动度量(如FID)或进行缺乏可靠性或可重复性的人类评估。本文提出了一个标准化且明确定义的人类评估协议,以促进未来作品中可验证和可重复的人类评估。在我们的初步数据采集中,我们实验性地表明了当前的自动度量与人类感知在评估文本到图像生成的结果性能方面不兼容。此外,我们提供了可靠和有力地设计人类评估实验的洞见。最后,我们公开了几个资源,以便社区轻松和快速实现。