In many automation tasks involving manipulation of rigid objects, the poses of the objects must be acquired. Vision-based pose estimation using a single RGB or RGB-D sensor is especially popular due to its broad applicability. However, single-view pose estimation is inherently limited by depth ambiguity and ambiguities imposed by various phenomena like occlusion, self-occlusion, reflections, etc. Aggregation of information from multiple views can potentially resolve these ambiguities, but the current state-of-the-art multi-view pose estimation method only uses multiple views to aggregate single-view pose estimates, and thus rely on obtaining good single-view estimates. We present a multi-view pose estimation method which aggregates learned 2D-3D distributions from multiple views for both the initial estimate and optional refinement. Our method performs probabilistic sampling of 3D-3D correspondences under epipolar constraints using learned 2D-3D correspondence distributions which are implicitly trained to respect visual ambiguities such as symmetry. Evaluation on the T-LESS dataset shows that our method reduces pose estimation errors by 80-91% compared to the best single-view method, and we present state-of-the-art results on T-LESS with four views, even compared with methods using five and eight views.
翻译:在许多涉及操纵僵硬物体的自动化任务中,必须掌握物体的外形。使用单一 RGB 或 RGB-D 传感器的视觉估计尤其受欢迎,因为其广泛适用性。然而,单视估计具有内在的局限性,因为隐蔽性、自我隔离、反射等各种现象造成的深度模糊和模糊。 从多种观点汇总信息,可能解决这些模糊问题,但目前最先进的多视角估计方法只使用多种观点来综合单一视角的表面估计,从而依靠获得良好的单一视角估计。我们呈现一种多视角估计方法,将最初估计和可选改进的多种观点中的2D-3D分布汇总起来。我们的方法利用所学的2D-3D 函授分布进行概率抽样,这些分布被隐含了尊重视觉模糊性的培训,例如对称性。 T-LESS 数据集的评估表明,与最佳的单一视角方法相比,我们的方法减少了80-91 % 。我们用四种观点对T-SES 进行了对比,我们用四种观点进行了对比。