In this work, we present a novel framework built to simplify 3D asset generation for amateur users. To enable interactive generation, our method supports a variety of input modalities that can be easily provided by a human, including images, text, partially observed shapes and combinations of these, further allowing to adjust the strength of each input. At the core of our approach is an encoder-decoder, compressing 3D shapes into a compact latent representation, upon which a diffusion model is learned. To enable a variety of multi-modal inputs, we employ task-specific encoders with dropout followed by a cross-attention mechanism. Due to its flexibility, our model naturally supports a variety of tasks, outperforming prior works on shape completion, image-based 3D reconstruction, and text-to-3D. Most interestingly, our model can combine all these tasks into one swiss-army-knife tool, enabling the user to perform shape generation using incomplete shapes, images, and textual descriptions at the same time, providing the relative weights for each input and facilitating interactivity. Despite our approach being shape-only, we further show an efficient method to texture the generated shape using large-scale text-to-image models.
翻译:在这项工作中,我们提出了一个为简化业余用户的3D资产生成而建立的新框架。为了能够实现互动生成,我们的方法支持了人类可以轻松提供的各种输入模式,包括图像、文字、部分观测到的形状和这些形状的组合,从而进一步调整了每项输入的强度。在我们的方法的核心是将3D形状压缩成一个缩放模式的缩放潜代表器。为了能够提供多种多模式的投入,我们使用有辍学和交叉注意机制的特定任务编码器。由于它的灵活性,我们的模型自然支持了各种任务,在形状完成、基于图像的 3D 重建以及文本到 3D 方面比以前的工作表现得更好。最有趣的是,我们的模型可以将所有这些任务合并成一个 swis-army-knife 工具,使用户能够同时使用不完整的形状、图像和文字描述来进行形状生成,为每项输入提供相对的重量,并促进互动。尽管我们的方法是形状化的,但我们进一步展示了一种高效的文本生成方法。