Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts, even in cluttered, crowded scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the resulting data is distilled into a diffusion model that operates in real-time on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, \our achieves high-performance semantic grasping without any manually collected training data. For visualizations of this please visit our website at https://ifgrasping.github.io/
翻译:基于互联网规模数据训练的大规模视觉模型在分割和语义理解物体部件方面展现出强大能力,即使在杂乱拥挤的场景中亦如此。然而,尽管这些模型能够引导机器人朝向物体的大致区域,它们缺乏精确控制灵巧机器人手进行三维抓取所需的几何理解能力。为克服这一局限,我们的核心思路是利用仿真技术,结合理解场景中手部与物体局部几何特征的力闭合抓取生成流程。由于该流程速度较慢且需要真实观测数据,我们将生成的数据蒸馏至一个能在相机点云上实时运行的扩散模型中。通过融合互联网规模模型的全局语义理解能力与基于仿真的局部感知力闭合的几何精度,我们的方法无需任何人工收集的训练数据即可实现高性能语义抓取。相关可视化内容请访问我们的网站:https://ifgrasping.github.io/