Generating images from textual descriptions has gained a lot of attention. Recently, DALL-E, a multimodal transformer language model, and its variants have shown high-quality text-to-image generation capabilities with a simple architecture and training objective, powered by large-scale training data and computation. However, despite the interesting image generation results, there has not been a detailed analysis on how to evaluate such models. In this work, we investigate the reasoning capabilities and social biases of such text-to-image generative transformers in detail. First, we measure four visual reasoning skills: object recognition, object counting, color recognition, and spatial relation understanding. For this, we propose PaintSkills, a diagnostic dataset and evaluation toolkit that measures these four visual reasoning skills. Second, we measure the text alignment and quality of the generated images based on pretrained image captioning, image-text retrieval, and image classification models. Third, we assess social biases in the models. For this, we suggest evaluation of gender and racial biases of text-to-image generation models based on a pretrained image-text retrieval model and human evaluation. In our experiments, we show that recent text-to-image models perform better in recognizing and counting objects than recognizing colors and understanding spatial relations, while there exists a large gap between model performances and oracle accuracy on all skills. Next, we demonstrate that recent text-to-image models learn specific gender/racial biases from web image-text pairs. We also show that our automatic evaluations of visual reasoning skills and gender bias are highly correlated with human judgments. We hope our work will help guide future progress in improving text-to-image models on visual reasoning skills and social biases. Code and data at: https://github.com/j-min/DallEval
翻译:最近,DALL-E(一个多式联运变压器语言模型)及其变体展示了高质量的文本到图像生成能力,其简单架构和培训目标,以大规模培训数据和计算为动力。然而,尽管图像生成结果令人感兴趣,但还没有详细分析如何评价这些模型。在这项工作中,我们详细调查了这种文本到模拟变压器的推理能力和社会偏见。首先,我们测量了四种视觉推理技能:目标识别、目标计数、颜色识别和空间直观关系理解。为此,我们提出了“涂料”、诊断数据集和评价工具包,以测量四种视觉推理技能。第二,我们根据预先训练的图像说明、图像文本检索和图像分类模型来衡量所生成图像的文本一致性和质量。第三,我们评估了模型中的社会偏见。在这方面,我们建议根据预先训练的图像检索模型和人类直观关系理解,我们还在近期的图像理解中进行性别-直观评估,同时我们展示了最新的文本记录,在最新的文本和图像理解模型中,我们还展示了我们最新的性别-直径,同时我们还展示了最新的文本记录,在最新的理解中,我们从学习了我们最新的性别-直观和直径上的所有理解中,同时,我们还展示了所有的文本记录,在最新的理解中显示了我们最新的文本到理解中,我们最新的文本到理解中,我们从学习了所有的文本到理解中,我们从学习了所有的文本到理解。