Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. TIFA is a reference-free metric that allows for fine-grained and interpretable evaluations of generated images. TIFA also has better correlations with human judgments than existing metrics. Based on this approach, we introduce TIFA v1.0, a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.). We present a comprehensive evaluation of existing text-to-image models using TIFA v1.0 and highlight the limitations and challenges of current models. For instance, we find that current text-to-image models, despite doing well on color and material, still struggle in counting, spatial relations, and composing multiple objects. We hope our benchmark will help carefully measure the research progress in text-to-image synthesis and provide valuable insights for further research.
翻译:尽管成千上万的研究人员、工程师和艺术家积极致力于改进文本到图像生成模型,但系统往往无法生成与文本输入精确对齐的图像。我们引入TIFA(基于问答的文本到图像忠实度评估),这是一种自动评价指标,通过视觉问答(VQA)测量生成图像与其文本输入的忠诚度。具体而言,给定文本输入,我们使用语言模型自动生成几个问答对。我们通过检查现有的VQA模型是否能使用生成的图像回答这些问题来计算图像的忠实度。TIFA是一种无参考度量,允许对生成的图像进行细粒度和可解释的评估。TIFA还与现有的度量工具相比具有更好的人类判断相关性。基于此方法,我们引入了TIFA v1.0,这是一个包含12个类别(对象、计数等)的4K多样化文本输入和25K个问题的基准。我们使用TIFA v1.0对现有的文本到图像模型进行了全面评估,并强调了当前模型的限制和挑战。例如,我们发现,尽管当前文本到图像模型在颜色和材质上表现不错,但在计数、空间关系和组合多个对象方面仍然存在困难。我们希望我们的基准能够帮助仔细测量文本到图像合成的研究进展,并为进一步的研究提供有价值的见解。