Deep generative models are becoming increasingly powerful, now generating diverse high fidelity photo-realistic samples given text prompts. Have they reached the point where models of natural images can be used for generative data augmentation, helping to improve challenging discriminative tasks? We show that large-scale text-to image diffusion models can be fine-tuned to produce class conditional models with SOTA FID (1.76 at 256x256 resolution) and Inception Score (239 at 256x256). The model also yields a new SOTA in Classification Accuracy Scores (64.96 for 256x256 generative samples, improving to 69.24 for 1024x1024 samples). Augmenting the ImageNet training set with samples from the resulting models yields significant improvements in ImageNet classification accuracy over strong ResNet and Vision Transformer baselines.
翻译:深度生成模型变得越来越强大,可以根据文本提示生成多样化、高保真的逼真样本。它们已经达到了自然图像模型可用于生成数据增强的水平,有助于提高具有挑战性的区分任务吗?我们展示了大规模的文本到图像扩散模型可以被微调,以产生具有SOTA FID(在256x256分辨率下为1.76)和Inception分数(在256x256分辨率下为239)的类条件模型。该模型在分类准确性得分方面也产生了新的SOTA(在256x256生成样本方面为64.96,对于1024x1024样本,准确性得分提高至69.24)。通过采用来自生成模型的样本扩充ImageNet训练集,相对于强的ResNet和Vision Transformer基线,ImageNet分类准确率有了显著提高。