StyleGANs are at the forefront of controllable image generation as they produce a latent space that is semantically disentangled, making it suitable for image editing and manipulation. However, the performance of StyleGANs severely degrades when trained via class-conditioning on large-scale long-tailed datasets. We find that one reason for degradation is the collapse of latents for each class in the $\mathcal{W}$ latent space. With NoisyTwins, we first introduce an effective and inexpensive augmentation strategy for class embeddings, which then decorrelates the latents based on self-supervision in the $\mathcal{W}$ space. This decorrelation mitigates collapse, ensuring that our method preserves intra-class diversity with class-consistency in image generation. We show the effectiveness of our approach on large-scale real-world long-tailed datasets of ImageNet-LT and iNaturalist 2019, where our method outperforms other methods by $\sim 19\%$ on FID, establishing a new state-of-the-art.
翻译:StyleGANs 处于可控图像生成的前沿,因为它们产生了一个语义上解开的潜空间,适合于图像编辑和操作。然而,当通过大规模的长尾数据集进行类条件训练时,StyleGANs 的性能严重退化。我们发现退化的原因之一是在 $\mathcal{W}$ 潜空间中每个类别的潜在空间崩塌。通过 NoisyTwins,我们首先引入了一种有效且廉价的类嵌入增强策略,然后基于自监督在 $\mathcal{W}$ 空间中将潜在空间解耦。这种解耦缓解了崩溃,确保我们的方法在图像生成中保留了类内多样性和类一致性。我们在大规模真实世界的长尾数据集 ImageNet-LT 和 iNaturalist 2019 上展示了我们方法的有效性,其中我们的方法在 FID 上超过其他方法约 $\sim 19\%$,成为了新的 state-of-the-art。