Object pose estimation is a key perceptual capability in robotics. We propose a fully-convolutional extension of the PoseCNN method, which densely predicts object translations and orientations. This has several advantages such as improving the spatial resolution of the orientation predictions -- useful in highly-cluttered arrangements, significant reduction in parameters by avoiding full connectivity, and fast inference. We propose and discuss several aggregation methods for dense orientation predictions that can be applied as a post-processing step, such as averaging and clustering techniques. We demonstrate that our method achieves the same accuracy as PoseCNN on the challenging YCB-Video dataset and provide a detailed ablation study of several variants of our method. Finally, we demonstrate that the model can be further improved by inserting an iterative refinement module into the middle of the network, which enforces consistency of the prediction.
翻译:对象表面估计是机器人的一种关键感知能力。 我们提议全面扩展PoseCNN方法,该方法密集预测物体翻译和方向。 这有几个优点,如改进定向预测的空间分辨率 -- -- 在高度封闭的安排中有用,通过避免完全连通而大量减少参数,以及快速推导。我们提议并讨论可用作后处理步骤的密集定向预测的若干汇总方法,如平均法和集群法。我们证明,我们的方法在具有挑战性的YCB-Video数据集方面与PoseCNN的精确度相同,并对我们方法的若干变种进行详细的模拟研究。最后,我们证明,通过在网络中间插入一个迭接改进模块,可以进一步改进模型,使预测更加一致。