We address the new problem of language-guided semantic style transfer of 3D indoor scenes. The input is a 3D indoor scene mesh and several phrases that describe the target scene. Firstly, 3D vertex coordinates are mapped to RGB residues by a multi-layer perceptron. Secondly, colored 3D meshes are differentiablly rendered into 2D images, via a viewpoint sampling strategy tailored for indoor scenes. Thirdly, rendered 2D images are compared to phrases, via pre-trained vision-language models. Lastly, errors are back-propagated to the multi-layer perceptron to update vertex colors corresponding to certain semantic categories. We did large-scale qualitative analyses and A/B user tests, with the public ScanNet and SceneNN datasets. We demonstrate: (1) visually pleasing results that are potentially useful for multimedia applications. (2) rendering 3D indoor scenes from viewpoints consistent with human priors is important. (3) incorporating semantics significantly improve style transfer quality. (4) an HSV regularization term leads to results that are more consistent with inputs and generally rated better. Codes and user study toolbox are available at https://github.com/AIR-DISCOVER/LASST
翻译:我们处理3D室内场景的语言引导语义语义风格传输的新问题。 输入是一个 3D 室内场景网格和描述目标场景的几个词组。 首先, 3D 顶点坐标通过多层光谱绘制成 RGB 残留物。 第二, 彩色 3D 模头是不同的, 通过为室内场景量量定制的视觉抽样战略, 将其变成 2D 图像。 第三, 将 2D 图像与词组比较, 通过预先训练的视觉语言模型进行。 最后, 错误被反馈到多层的视界以更新与某些语义类相对应的脊椎颜色。 我们做了大规模的定性分析和A/ B 用户测试, 使用公共扫描网和 ScenenenNNN数据集。 我们展示:(1) 视觉优美结果, 可能对多媒体应用有用。 (2) 提供与人类前科一致的3D 室内镜片非常重要。 (3) 包含语义, 显著改进风格传输质量。 (4) HSV 正规化术语导致结果与输入更加一致, 并普遍评级更好。 代码/ 用户研究工具框可在 http:// ATSS/ LDISDISLbSS/ 。