A central goal of visual recognition is to understand objects and scenes from a single image. 2D recognition has witnessed tremendous progress thanks to large-scale learning and general-purpose representations. Comparatively, 3D poses new challenges stemming from occlusions not depicted in the image. Prior works try to overcome these by inferring from multiple views or rely on scarce CAD models and category-specific priors which hinder scaling to novel settings. In this work, we explore single-view 3D reconstruction by learning generalizable representations inspired by advances in self-supervised learning. We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos. Our model, Multiview Compressive Coding (MCC), learns to compress the input appearance and geometry to predict the 3D structure by querying a 3D-aware decoder. MCC's generality and efficiency allow it to learn from large-scale and diverse data sources with strong generalization to novel objects imagined by DALL$\cdot$E 2 or captured in-the-wild with an iPhone.
翻译:视觉识别的一个中心目标是从单一图像中了解对象和场景。 2D 识别通过大规模学习和一般用途演示取得了巨大进展。 比较而言, 3D 带来了来自图像中未描述的隔离的新挑战。 先前的工作试图通过从多种观点中推断或依赖阻碍缩放到新设置的稀缺的 CAD 模型和特定类别前科来克服这些挑战。 在这项工作中, 我们探索了单一视图 3D 重建, 学习了由自我监督学习的进展所启发的通用表达式。 我们引入了一个简单的框架, 在单个对象或整个场景的 3D点上操作, 并辅之以来自各种 RGB- D 视频的分类性大规模培训。 我们的模型、 多视图组合编码( MC C) 通过查询 3D 维维度解析器, 学会了预测3D 结构的投入外观和几度。 MCC 的概观和效率使其能够从大规模和多样的数据源中学习, 并且对由 DALL$cdot$ E 2 或以iPeld 所想象的新型物体进行强烈的概括。