Domain adaptation of 3D portraits has gained more and more attention. However, the transfer mechanism of existing methods is mainly based on vision or language, which ignores the potential of vision-language combined guidance. In this paper, we propose a vision-language coupled 3D portraits domain adaptation framework, namely Image and Text portrait (ITportrait). ITportrait relies on a two-stage alternating training strategy. In the first stage, we employ a 3D Artistic Paired Transfer (APT) method for image-guided style transfer. APT constructs paired photo-realistic portraits to obtain accurate artistic poses, which helps ITportrait to achieve high-quality 3D style transfer. In the second stage, we propose a 3D Image-Text Embedding (ITE) approach in the CLIP space. ITE uses a threshold function to adaptively control the optimization direction of image or text in the CLIP space. Comprehensive quantitative and qualitative results show that our ITportrait achieves state-of-the-art (SOTA) results and benefits downstream tasks. All source codes and pre-trained models will be released to the public.
翻译:3D肖像的领域自适应越来越受到关注。然而,现有方法的转移机制主要基于视觉或语言,忽略了视觉-语言结合引导的潜力。在本文中,我们提出了一种视觉-语言耦合的3D肖像领域自适应框架,即图像-文本肖像(ITportrait)。ITportrait依赖于两阶段交替训练策略。在第一阶段,我们采用3D艺术配对转换(APT)方法进行图像引导风格转移。APT构建配对的逼真肖像,以获得准确的艺术造型,有利于ITportrait实现高质量的3D风格转移。在第二阶段,我们提出了一种基于CLIP空间的3D图像-文本嵌入(ITE)方法。ITE使用阈值函数来自适应地控制CLIP空间中图像或文本的优化方向。综合定量和定性结果表明,我们的ITportrait实现了最先进的结果并有助于下游任务。所有源代码和预训练模型将发布给公众。