Semantic image synthesis aims to generate photo realistic images given a semantic segmentation map. Despite much recent progress, training them still requires large datasets of images annotated with per-pixel label maps that are extremely tedious to obtain. To alleviate the high annotation cost, we propose a transfer method that leverages a model trained on a large source dataset to improve the learning ability on small target datasets via estimated pairwise relations between source and target classes. The class affinity matrix is introduced as a first layer to the source model to make it compatible with the target label maps, and the source model is then further finetuned for the target domain. To estimate the class affinities we consider different approaches to leverage prior knowledge: semantic segmentation on the source domain, textual label embeddings, and self-supervised vision features. We apply our approach to GAN-based and diffusion-based architectures for semantic synthesis. Our experiments show that the different ways to estimate class affinity can be effectively combined, and that our approach significantly improves over existing state-of-the-art transfer approaches for generative image models.
翻译:语义图像合成旨在生成给定语义分割图的照片级逼真图像。尽管最近取得了很大进展,但它们的训练仍需要大量图像数据集的标注,这些标注非常费时费力。为了减少高昂的注释成本,我们提出了一种转移方法,利用在大型源数据集上训练的模型通过估计源和目标类之间的成对关系来改善对小目标数据集的学习能力。类亲和矩阵被引入作为源模型的第一层,使其与目标标签映射兼容,然后源模型进一步用于目标领域的微调。为了估计类别亲和力,我们考虑了不同的方法来利用先验知识:源领域的语义分割,文本标签嵌入以及自监督视觉特征。我们将我们的方法应用于基于GAN和扩散的语义合成体系结构。我们的实验表明可以有效地结合不同的估算类别亲和度的方法,并且我们的方法显著优于现有的最先进的用于生成图像模型的转移方法。