In many automation tasks involving manipulation of rigid objects, the poses of the objects must be acquired. Vision-based pose estimation using a single RGB or RGB-D sensor is especially popular due to its broad applicability. However, single-view pose estimation is inherently limited by depth ambiguity and ambiguities imposed by various phenomena like occlusion, self-occlusion, reflections, etc. Aggregation of information from multiple views can potentially resolve these ambiguities, but the current state-of-the-art multi-view pose estimation method only uses multiple views to aggregate single-view pose estimates, and thus rely on obtaining good single-view estimates. We present a multi-view pose estimation method which aggregates learned 2D-3D distributions from multiple views for both the initial estimate and optional refinement. Our method performs probabilistic sampling of 3D-3D correspondences under epipolar constraints using learned 2D-3D correspondence distributions which are implicitly trained to respect visual ambiguities such as symmetry. Evaluation on the T-LESS dataset shows that our method reduces pose estimation errors by 80-91% compared to the best single-view method, and we present state-of-the-art results on T-LESS with four views, even compared with methods using five and eight views.
翻译:在许多涉及刚体物体操作的自动化任务中,需要获取物体的姿态。基于单个RGB或RGB-D传感器的基于视觉的姿态估计因其广泛的适用性而受到特别关注。然而,单视图姿态估计受到深度不确定性、遮挡、自遮挡、反射等各种现象导致的歧义的固有限制。聚合多视图信息可能解决这些歧义,但目前最先进的多视图姿态估计方法仅使用多个视图来聚合单视图姿态估计,因此需要获得良好的单视图估计。我们提出了一种多视图姿态估计方法,其聚合了多个视图中学习的2D-3D分布,作为初始估计和可选的细化步骤。我们的方法使用学习的2D-3D对应分布对基于极线约束的3D-3D对应关系进行概率采样,这些分布在隐式训练中被训练成遵循视觉歧义(例如对称性)。在T-LESS数据集上评估结果显示,与最佳单视图方法相比,我们的方法将姿态估计误差降低了80-91%,并在四个视图上呈现出现有技术的最新结果,即使与使用五个和八个视图的方法相比,也是如此。