Estimating 6D poses of objects is an essential computer vision task. However, most conventional approaches rely on camera data from a single perspective and therefore suffer from occlusions. We overcome this issue with our novel multi-view 6D pose estimation method called MV6D which accurately predicts the 6D poses of all objects in a cluttered scene based on RGB-D images from multiple perspectives. We base our approach on the PVN3D network that uses a single RGB-D image to predict keypoints of the target objects. We extend this approach by using a combined point cloud from multiple views and fusing the images from each view with a DenseFusion layer. In contrast to current multi-view pose detection networks such as CosyPose, our MV6D can learn the fusion of multiple perspectives in an end-to-end manner and does not require multiple prediction stages or subsequent fine tuning of the prediction. Furthermore, we present three novel photorealistic datasets of cluttered scenes with heavy occlusions. All of them contain RGB-D images from multiple perspectives and the ground truth for instance semantic segmentation and 6D pose estimation. MV6D significantly outperforms the state-of-the-art in multi-view 6D pose estimation even in cases where the camera poses are known inaccurately. Furthermore, we show that our approach is robust towards dynamic camera setups and that its accuracy increases incrementally with an increasing number of perspectives.
翻译:估计 6D 对象的 6D 配置是一个基本的计算机愿景任务。 然而, 多数常规方法都依赖于从单一角度的相机数据, 并因此受到隐蔽性的影响。 我们用我们的新颖的多视图 6D 构成估计方法克服了这个问题, 这种方法精确地预测了以 RGB- D 图像为基础的杂乱的场景中所有物体的 6D 构成。 我们的方法基于 PVN3D 网络, 该网络使用一个单一的 RGB- D 图像来预测目标对象的关键点 。 我们通过使用多个视图的合并点云和将每个视图的图像与 DenseFusion 层混为一体来扩展这个方法。 与当前多视图 6D 的显示网络相比, 我们的MV6 显示网络以端到端的方式, 不需要多个预测阶段或随后对预测进行微调。 此外, 我们展示了三个具有严重隐蔽性的相机图像的图像数据集。 所有这些图像中包含 RGB- D 6D 的图像, 正在大量显示真实的图像 。