This paper tackles a challenging problem of generating photorealistic images from semantic layouts in few-shot scenarios where annotated training pairs are hardly available but pixel-wise annotation is quite costly. We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior. Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks. With such mappings, we can generate an unlimited number of pseudo semantic masks from random noise to train an encoder for controlling a pre-trained StyleGAN generator. Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles. Qualitative and quantitative results with various datasets demonstrate improvement over previous approaches with respect to layout fidelity and visual quality in as few as one- or five-shot settings.
翻译:本文解决了一个具有挑战性的问题,即从几发图像的语义布局中生成具有光现实性的图像,在这种情景下,很难找到附加说明的培训配对,但像素的注释却非常昂贵。我们提出了一个使用StyleGAN 之前的“StyleGAN”对语义面罩进行假标签的训练策略。我们的主要想法是,在StyleGAN 特征和每个语义类的语义面罩的几个例子之间建立一个简单的绘图图象。有了这样的绘图,我们可以从随机的噪音中产生数量无限的假语义面罩,用于培训用于控制预先训练过的StyleGAN 生成器的编码器。虽然假语义面罩可能对于以前需要像素粘贴面罩的方法来说过于粗糙,但我们的框架可以综合高质量的图像,不仅来自密集的语义面罩,而且来自地标和刻图等稀少的投入。 与各种数据集的定性和定量结果表明,与以前在一或五发相的环境中布局的准确性和视觉质量方法相比有所改善。