DATID-3D: 利用文字到图像的传播为三创模式提供多样化的、有保护的域域适应 (DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model)

Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.

翻译：最近的3 基因化模型在综合高分辨率光化现实图像方面取得了显著的成绩,并具有一致性和详细的 3D 形状,但是在不同的领域培训这些模型具有挑战性,因为它需要大量的培训图像和相机分发信息。文本引导域适应方法在将一个域的2D基因化模型转换为具有不同风格的其他域的模型方面表现出令人印象深刻的性能,它利用了CLIP(语言模拟培训前),而不是为这些域收集了大量的数据集。然而,其中的一个缺点是,原始基因化模型的样本多样性在域内适应的多域域域域域内,没有很好地预示图像的样本多样性,因为CLIP(CLIP)文本编码的确定性性质。文本引导域内调整方法对于3D来说,不仅由于灾难性的多样性损失,而且由于文本的低劣度,而且由于图像对应和图像质量低劣,因此对于3D类域域域域域内的变异性变异性变异性模型来说,我们用文字变异性变异的域图解图解3D,在前域域域域内可以将不同的图像变换成新的版本。