Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.
翻译:生成模型和代表制学习是计算机愿景中的两项关键任务。然而,这些模型通常是独立培训的,忽视了每项任务对另一个任务的潜在帮助,导致培训和模型维护间接费用。在这项工作中,我们提出了Masked Generation Concoder(MAGE),这是统一SOTA图像生成和自我监督代表制学习的第一个框架。我们的主要见解是,在蒙面图像建模培训前使用可变的掩码比例,可以在同一培训框架内进行基因培训(非常高的遮罩率)和代表制学习(低遮罩率)。受先前的基因模型的启发,MAGE使用了由矢量定量的GAN在投入和产出中学习的语义符号,与掩码相结合。我们可以通过给孵化器输出增加对比性损失来进一步改进其代表性。我们广泛评价MAGEGE的生成和代表制学习能力。在SimageNet-1K上,一个单一的MAGEVT-L模型获得9.10 FID,在课堂不固定图像生成任务中获取78.9%的最高-1精确度,在线性Produalinginginginginging/Lismational-dealmationalmal ormation ormation ormationalmationalpalmalmalmalmalmation 14。