When creating 3D content, highly specialized skills are generally needed to design and generate models of objects and other assets by hand. We address this problem through high-quality 3D asset retrieval from multi-modal inputs, including 2D sketches, images and text. We use CLIP as it provides a bridge to higher-level latent features. We use these features to perform a multi-modality fusion to address the lack of artistic control that affects common data-driven approaches. Our approach allows for multi-modal conditional feature-driven retrieval through a 3D asset database, by utilizing a combination of input latent embeddings. We explore the effects of different combinations of feature embeddings across different input types and weighting methods.
翻译:在创建 3D 内容时,通常需要高度专业化的技能来设计和生成用手制作物体和其他资产模型。我们通过从多模式投入,包括2D 草图、图像和文本中高质量的3D 资产检索来解决这一问题。我们使用CLIP作为连接高层次潜伏特征的桥梁。我们利用这些特征来进行多模式融合,以解决缺乏艺术控制从而影响共同的数据驱动方法的问题。我们的方法允许通过3D 资产数据库,利用各种输入潜嵌嵌式的组合,通过多模式有条件的特性驱动检索。我们探索不同输入类型和加权方法的地物嵌入不同组合的效果。