In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module maps real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can produce diverse and high-quality images with an unprecedented resolution at 1024. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels, with or without instance guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.
翻译:在这项工作中,我们提出TediGAN,这是一个用于多模式图像生成和用文本描述进行操作的新框架。建议的方法由三个部分组成:StyleGAN 反向模块、视觉语言相似性学习和试级优化。反向模块将真实图像映射到受过良好训练的StyGAN的潜在空间。视觉语言相似性通过将图像和文本映射成共同嵌入空间来学习文本图像匹配。实例一级优化是用于在操作中保存身份。我们的模型可以在1024年生成具有前所未有的分辨率的多样化和高质量图像。使用基于样式混合的控制机制,我们的TediGAN内在地支持以多模式输入的图像合成,例如草图或语义标签,同时或不提供实例指导。为了便利文本指导多模式合成,我们提议多模式化的图像和文本图像匹配。由真实面图像和相应的语义分割图、草图和文字描述构成的大型数据集。在引入的数据设置上,TetiGADox 展示了我们可用的高级数据方法。