We present a unified and compact representation for object rendering, 3D reconstruction, and grasp pose prediction that can be inferred from a single image within a few seconds. We achieve this by leveraging recent advances in the Neural Radiance Field (NeRF) literature that learn category-level priors and fine-tune on novel objects with minimal data and time. Our insight is that we can learn a compact shape representation and extract meaningful additional information from it, such as grasping poses. We believe this to be the first work to retrieve grasping poses directly from a NeRF-based representation using a single viewpoint (RGB-only), rather than going through a secondary network and/or representation. When compared to prior art, our method is two to three orders of magnitude smaller while achieving comparable performance at view reconstruction and grasping. Accompanying our method, we also propose a new dataset of rendered shoes for training a sim-2-real NeRF method with grasping poses for different widths of grippers.
翻译:我们提出一个统一和紧凑的表达方式,用于对象的形成、3D重建以及掌握能够从几秒钟内从单一图像中推断出来的预测。我们通过利用神经辐射场(NeRF)文献的最新进展来实现这一点,这些文献学习了分类级的前科和微调,并用极少的数据和时间对新对象进行了微调。我们的见解是,我们可以从中学习一个集束形状的表示方式,并从中提取有意义的额外信息,例如握取的姿势。我们认为,这是利用单一观点(仅使用RGB)直接从基于NeRF的表示方式中提取抓取代表的首次工作,而不是通过一个二级网络和(或)代表方式。与以前的艺术相比,我们的方法是2到3级的规模较小,同时在视觉的重建和捕捉过程中取得可比的成绩。我们的方法还提出一套新的制鞋数据集,用于训练一种Sim-2现实的NERF方法,用不同宽的握式握式的姿势。