A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose.
翻译:从 RGB-D 图像中进行6D 对象估计的一个关键技术挑战是充分利用两个补充数据源。以前的工程要么单独从 RGB 图像和深度中提取信息,要么使用昂贵的后处理步骤,在高度混乱的场景和实时应用中限制其性能。在这项工作中,我们提出了“enseFusion”,用于估计6D 图像中一组已知物体的通用框架,即RGB-D 图像。“enseFusion”是一个复杂结构,它单独处理两个数据源,并使用新型密集集成网络提取像素明智的密集特性嵌入,从中估计其形态。此外,我们整合了一个端到端的迭接式改进程序,以进一步改进外观估计,同时实现近实时的推断。我们的实验表明,我们的方法在两个数据集,YCB-Video 和 LineMOD 中,超越了最先进的方法。我们还将我们提出的方法运用到一个真正的机器人,以便根据估计的外观来捕捉和操纵物体。