State-of-the-art object pose estimation handles multiple instances in a test image by using multi-model formulations: detection as a first stage and then separately trained networks per object for 2D-3D geometric correspondence prediction as a second stage. Poses are subsequently estimated using the Perspective-n-Points algorithm at runtime. Unfortunately, multi-model formulations are slow and do not scale well with the number of object instances involved. Recent approaches show that direct 6D object pose estimation is feasible when derived from the aforementioned geometric correspondences. We present an approach that learns an intermediate geometric representation of multiple objects to directly regress 6D poses of all instances in a test image. The inherent end-to-end trainability overcomes the requirement of separately processing individual object instances. By calculating the mutual Intersection-over-Unions, pose hypotheses are clustered into distinct instances, which achieves negligible runtime overhead with respect to the number of object instances. Results on multiple challenging standard datasets show that the pose estimation performance is superior to single-model state-of-the-art approaches despite being more than ~35 times faster. We additionally provide an analysis showing real-time applicability (>24 fps) for images where more than 90 object instances are present. Further results show the advantage of supervising geometric-correspondence-based object pose estimation with the 6D pose.
翻译:使用多模型配方,在测试图像中进行多种估计:作为第一阶段检测,然后作为第二阶段对每个对象进行2D-3D的几何对应预测,每个对象分别培训网络,作为2D-3D的几何对应预测。随后使用运行时的透视-点算法对波进行估算。不幸的是,多模型配方缓慢,与所涉对象实例的数量相比规模不高。最近的方法显示,直接6D对象在从上述几何对应中得出估算是可行的。我们展示了一种方法,在测试图像中,从多个对象到直接回溯 6D 构成所有实例的中间几何表示。固有的端对端可培训能力克服了单独处理单个对象实例的要求。通过计算相互剖析-点算法,将假设分组成不同的情况,与对象实例的数量相比,运行时间微不足道。多个具有挑战性的标准数据集的结果显示,尽管在测试目标对象中直接回归6D 6-35倍以上的所有情况中,但对多个物体的模拟表示优于单一模型状态方法。我们提供了更精确的图像的可应用性分析。我们提供了更近90度,展示了对方位的图像的准确性。