Solving the camera-to-robot pose is a fundamental requirement for vision-based robot control, and is a process that takes considerable effort and cares to make accurate. Traditional approaches require modification of the robot via markers, and subsequent deep learning approaches enabled markerless feature extraction. Mainstream deep learning methods only use synthetic data and rely on Domain Randomization to fill the sim-to-real gap, because acquiring the 3D annotation is labor-intensive. In this work, we go beyond the limitation of 3D annotations for real-world data. We propose an end-to-end pose estimation framework that is capable of online camera-to-robot calibration and a self-supervised training method to scale the training to unlabeled real-world data. Our framework combines deep learning and geometric vision for solving the robot pose, and the pipeline is fully differentiable. To train the Camera-to-Robot Pose Estimation Network (CtRNet), we leverage foreground segmentation and differentiable rendering for image-level self-supervision. The pose prediction is visualized through a renderer and the image loss with the input image is back-propagated to train the neural network. Our experimental results on two public real datasets confirm the effectiveness of our approach over existing works. We also integrate our framework into a visual servoing system to demonstrate the promise of real-time precise robot pose estimation for automation tasks.
翻译:解决相机到机器人的配置是基于视觉的机器人控制的基本要求,这是一个需要相当努力和谨慎以准确性的过程。传统方法要求通过标记对机器人进行修改,随后的深层次学习方法能够进行无标记特征提取。主流深层学习方法只使用合成数据,依靠Domain Randization来填补模拟到现实的差距,因为获得3D批注是劳动密集型的。在这项工作中,我们超越了对真实世界数据的3D批注的限制。我们提出了一个能够在线相机到机器人校准和自我监督的培训方法。传统方法需要通过标记和随后的深层次学习方法对机器人进行修改,以将培训规模扩大到无标记的现实世界数据。我们的框架将深度学习和几何测图愿景结合起来,以解决机器人的布局,而管道是完全不同的。为了培训摄像机到Robot Pose Estimation 网络(CtRNet),我们利用地表对地面的分解和不同图像水平自我监督的可变的图像估计方法。我们现在的系统预测是通过直观的系统进行直观的图像定位,而将我们现有的图像的图像的图像的模拟的模拟的图像的图像的模拟后演化结果,我们现在的图像的图像的图像的图像的图像的图像的演制成。</s>