Computer graphics, 3D computer vision and robotics communities have produced multiple approaches to represent and generate 3D shapes, as well as a vast number of use cases. However, single-view reconstruction remains a challenging topic that can unlock various interesting use cases such as interactive design. In this work, we propose a novel framework that leverages the intermediate latent spaces of Vision Transformer (ViT) and a joint image-text representational model, CLIP, for fast and efficient Single View Reconstruction (SVR). More specifically, we propose a novel mapping network architecture that learns a mapping between deep features extracted from ViT and CLIP, and the latent space of a base 3D generative model. Unlike previous work, our method enables view-agnostic reconstruction of 3D shapes, even in the presence of large occlusions. We use the ShapeNetV2 dataset and perform extensive experiments with comparisons to SOTA methods to demonstrate our method's effectiveness.
翻译:计算机图形、 3D 计算机视觉和机器人社区已经制作了多种方法来代表并生成 3D 形状, 以及大量使用案例。 然而, 单视重建仍然是一个具有挑战性的议题, 能够解开互动设计等各种有趣的使用案例。 在这项工作中,我们提出了一个新的框架,利用视觉变异器(ViT)的中间潜在空间和一个共同图像-文本代表模型( CLIP ), 用于快速和高效的单一视图重建( SVR ) 。 更具体地说, 我们提议建立一个新型的绘图网络架构, 来学习从 ViT 和 CLIP 提取的深度特征与基底 3D 基因化模型的潜在空间之间的绘图。 我们的方法与以往的工作不同, 3D 3D 形状可以进行视觉- 不可知性重建, 即使存在大型隐蔽性。 我们使用 ShapeNetV2 数据集, 并进行与 SOTA 方法进行比较的广泛实验, 以展示我们的方法的有效性 。