6D object pose estimation is widely applied in robotic tasks such as grasping and manipulation. Prior methods using RGB-only images are vulnerable to heavy occlusion and poor illumination, so it is important to complement them with depth information. However, existing methods using RGB-D data cannot adequately exploit consistent and complementary information between RGB and depth modalities. In this paper, we present a novel method to effectively consider the correlation within and across both modalities with attention mechanism to learn discriminative and compact multi-modal features. Then, effective fusion strategies for intra- and inter-correlation modules are explored to ensure efficient information flow between RGB and depth. To our best knowledge, this is the first work to explore effective intra- and inter-modality fusion in 6D pose estimation. The experimental results show that our method can achieve the state-of-the-art performance on LineMOD and YCB-Video dataset. We also demonstrate that the proposed method can benefit a real-world robot grasping task by providing accurate object pose estimation.
翻译:6D对象的估测是广泛应用于机器人任务,如捕捉和操纵。以前使用RGB图象的方法很容易被严重封闭和低光化,因此重要的是要用深度信息来补充这些图象。然而,使用RGB-D数据的现有方法无法充分利用RGB-D数据与深度模式之间的一致和互补信息。在本文中,我们提出了一个新颖的方法,以有效审议两种模式内部和之间的相关性,同时关注机制以学习歧视性和紧凑的多模式特征。然后,探索内部和相互交错模块的有效聚合战略,以确保RGB和深度之间的有效信息流动。根据我们的最佳知识,这是探索6D中有效内部和现代融合的首次工作。实验结果显示,我们的方法可以实现LineMOD和YCB-Video数据集的状态性能。我们还表明,拟议的方法可以通过提供准确的天体估计,有益于真实世界机器人掌握的任务。