Domain adaptation of 3D portraits has gained more and more attention. However, the transfer mechanism of existing methods is mainly based on vision or language, which ignores the potential of vision-language combined guidance. In this paper, we propose an Image-Text multi-modal framework, namely Image and Text portrait (ITportrait), for 3D portrait domain adaptation. ITportrait relies on a two-stage alternating training strategy. In the first stage, we employ a 3D Artistic Paired Transfer (APT) method for image-guided style transfer. APT constructs paired photo-realistic portraits to obtain accurate artistic poses, which helps ITportrait to achieve high-quality 3D style transfer. In the second stage, we propose a 3D Image-Text Embedding (ITE) approach in the CLIP space. ITE uses a threshold function to self-adaptively control the optimization direction of images or texts in the CLIP space. Comprehensive experiments prove that our ITportrait achieves state-of-the-art (SOTA) results and benefits downstream tasks. All source codes and pre-trained models will be released to the public.
翻译:三维肖像领域自适应越来越受到关注。然而,现有方法的转移机制主要基于视觉或语言,忽略了图文组合指导的潜力。在本文中,我们提出了一种图像-文本多模态框架,即图像和文本肖像(ITportrait),用于三维肖像领域自适应。ITportrait依赖于两阶段交替训练策略。在第一阶段,我们采用了一种3D艺术配对转移(APT)方法进行图像引导的风格转移。APT构建成对的照片逼真的肖像,以获取准确的艺术姿势,有助于ITportrait实现高质量的三维风格转移。在第二阶段,我们提出了一种3D图像-文本嵌入(ITE)方法,在CLIP空间中使用阈值函数自适应地控制图像或文本的优化方向。全面的实验证明了我们的ITportrait实现了最先进的结果,并有利于下游任务。所有源代码和预训练模型将向公众发布。