Creation of 3D content by stylization is a promising yet challenging problem in computer vision and graphics research. In this work, we focus on stylizing photorealistic appearance renderings of a given surface mesh of arbitrary topology. Motivated by the recent surge of cross-modal supervision of the Contrastive Language-Image Pre-training (CLIP) model, we propose TANGO, which transfers the appearance style of a given 3D shape according to a text prompt in a photorealistic manner. Technically, we propose to disentangle the appearance style as the spatially varying bidirectional reflectance distribution function, the local geometric variation, and the lighting condition, which are jointly optimized, via supervision of the CLIP loss, by a spherical Gaussians based differentiable renderer. As such, TANGO enables photorealistic 3D style transfer by automatically predicting reflectance effects even for bare, low-quality meshes, without training on a task-specific dataset. Extensive experiments show that TANGO outperforms existing methods of text-driven 3D style transfer in terms of photorealistic quality, consistency of 3D geometry, and robustness when stylizing low-quality meshes. Our codes and results are available at our project webpage https://cyw-3d.github.io/tango/.
翻译:在计算机视觉和图形研究中,3D内容的立体生成是一个充满希望但富有挑战性的问题。在这项工作中,我们侧重于将一个任意地貌表面网格的光现实外观显示功能的典型化。受最近对不同语言图象(CLIP)培训前的对比语言图象(CLIP)模式的跨模式监督激增的驱动,我们建议TANGO通过自动预测光、低质量的模具的反射效果,不进行任务特定数据集的培训,将3D形状的外观风格分解为空间上差异的双向反射分布功能、当地几何变量变异和照明状态,通过基于不同造型的球体高校对CLIP损失的监管,共同优化。因此,TANGO能够通过自动预测光、低质量的模版3D风格转换,即使不进行任务特定数据集培训。 广泛的实验显示,TANGO在以低质量/高水平的地理-D格式传输时,我们现有的文本/高质量/高水平的图像标准是现有地理-图像质量的图象性标准。