Evaluating and comparing text-to-image models is a challenging problem. Significant advances in the field have recently been made, piquing interest of various industrial sectors. As a consequence, a gold standard in the field should cover a variety of tasks and application contexts. In this paper a novel evaluation approach is experimented, on the basis of: (i) a curated data set, made by high-quality royalty-free image-text pairs, divided into ten categories; (ii) a quantitative metric, the CLIP-score, (iii) a human evaluation task to distinguish, for a given text, the real and the generated images. The proposed method has been applied to the most recent models, i.e., DALLE2, Latent Diffusion, Stable Diffusion, GLIDE and Craiyon. Early experimental results show that the accuracy of the human judgement is fully coherent with the CLIP-score. The dataset has been made available to the public.
翻译:评估和比较文本到图像模型是一个具有挑战性的问题,最近在这一领域取得了重大的进展,这符合各工业部门的兴趣,因此,实地的黄金标准应涵盖各种任务和应用背景。在本文件中,根据以下几个方面试行了一种新的评价方法:(一) 由高质量免特许权使用费图像文本对等制作的成套汇编数据,分为十个类别;(二) 量化指标,即CLIP核心;(三) 人力评价任务,以区分某一文本的真实和生成图像。提议的方法已应用于最新的模型,即DALLELE2、Lentnt Difmission、StaclDifent、GLIDE和Craiyon。早期实验结果显示,人类判断的准确性与CLIP核心完全一致。数据集已经提供给公众。