Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. TIFA is a reference-free metric that allows for fine-grained and interpretable evaluations of generated images. TIFA also has better correlations with human judgments than existing metrics. Based on this approach, we introduce TIFA v1.0, a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.). We present a comprehensive evaluation of existing text-to-image models using TIFA v1.0 and highlight the limitations and challenges of current models. For instance, we find that current text-to-image models, despite doing well on color and material, still struggle in counting, spatial relations, and composing multiple objects. We hope our benchmark will help carefully measure the research progress in text-to-image synthesis and provide valuable insights for further research.
翻译:尽管数千名研究人员、工程师和艺术家积极致力于改进文本到图像生成模型,但这些系统通常未能生成与其文本输入完全对齐的图像。本文介绍了TIFA(基于问答的文本到图像准确性评估),这是一种通过图像问答系统(VQA)来测量生成图像与其文本输入是不是准确匹配的自动评估度量。具体而言,给定一个文本输入,我们会使用一种语言模型自动生成几个问题-答案对。我们将图像的准确性计算为这些问题是否能够使用生成的图像回答。TIFA是一种非参考度量,允许对生成的图像进行细粒度和可解释的评估。TIFA与现有度量方法相比具有更好的与人类判断相关性。基于这种方法,我们介绍了TIFA v1.0,这是一个由12个类别(对象、计数等)的4K个不同文本输入和25K个问题组成的基准。我们使用TIFA v1.0对现有的文本到图像模型进行了全面的评估,并强调了当前模型的局限性和挑战。例如,我们发现当前的文本到图像模型在颜色和纹理方面表现良好,但在计数、空间关系和组合多个对象方面仍然存在问题。我们希望我们的基准测试能够帮助精确测量文本到图像合成的研究进展,并为进一步研究提供有价值的见解。