My goal in this paper is twofold: to study how well deep models can understand the images generated by DALL-E 2 and Midjourney, and to quantitatively evaluate these generative models. Two sets of generated images are collected for object recognition and visual question answering (VQA) tasks. On object recognition, the best model, out of 10 state-of-the-art object recognition models, achieves about 60\% and 80\% top-1 and top-5 accuracy, respectively. These numbers are much lower than the best accuracy on the ImageNet dataset (91\% and 99\%). On VQA, the OFA model scores 77.3\% on answering 241 binary questions across 50 images. This model scores 94.7\% on the binary VQA-v2 dataset. Humans are able to recognize the generated images and answer questions on them easily. We conclude that a) deep models struggle to understand the generated content, and may do better after fine-tuning, and b) there is a large distribution shift between the generated images and the real photographs. The distribution shift appears to be category-dependent. Data is available at: https://drive.google.com/file/d/1n2nCiaXtYJRRF2R73-LNE3zggeU_HeH0/view?usp=sharing.
翻译:在本文中,我的目标有两个:研究深深模型如何理解DALL-E 2和Midjourney生成的图像,并定量评估这些基因模型。收集了两组生成的图像,用于对象识别和视觉问题解答(VQA)任务。在对象识别方面,在10个最先进的对象识别模型中,最佳模型分别达到大约60 ⁇ 和80 ⁇ 顶1和顶5的精确度。这些数字大大低于图像网络数据集(91 ⁇ 和99 ⁇ )的最佳精确度。在 VQA上,OFA模型在50个图像中回答241个二进制问题时,得分77.3 ⁇ 。在二进制VQA-VQA-V2数据集中,得分94.7 ⁇ 。人类能够很容易地识别生成的图像并回答问题。我们的结论是,a)深模型为理解生成的内容而经过微调后可能做得更好。b)在生成的图像和真实照片之间有很大的分布变化。在VQA上,OFA模型似乎取决于类别。数据可在 http://HDRVS/R3GO=L.com/file.