We propose a novel method for joint estimation of shape and pose of rigid objects from their sequentially observed RGB-D images. In sharp contrast to past approaches that rely on complex non-linear optimization, we propose to formulate it as a neural optimization that learns to efficiently estimate the shape and pose. We introduce Deep Directional Distance Function (DeepDDF), a neural network that directly outputs the depth image of an object given the camera viewpoint and viewing direction, for efficient error computation in 2D image space. We formulate the joint estimation itself as a Transformer which we refer to as TransPoser. We fully leverage the tokenization and multi-head attention to sequentially process the growing set of observations and to efficiently update the shape and pose with a learned momentum, respectively. Experimental results on synthetic and real data show that DeepDDF achieves high accuracy as a category-level object shape representation and TransPoser achieves state-of-the-art accuracy efficiently for joint shape and pose estimation.
翻译:我们提出了一种新的方法,用于从顺序观察的RGB-D图像中联合估计刚体对象的形状和姿态。与依赖复杂非线性优化的过去方法形成鲜明对比,我们建议将其制定为神经优化,学习高效估算形状和姿态。我们引入深度定向距离函数(DeepDDF),直接输出给定摄像机视点和视方向的对象的深度图像,以在2D图像空间中高效计算误差。我们将关节估计本身制定为变压器,称为TransPoser。我们充分利用了标记化和多头注意,以逐步处理不断增长的观察集,并用学习的动量高效更新形状和姿态。在合成和实际数据上的实验结果表明,DeepDDF作为类别级对象形状表示实现了高精度,而TransPoser在联合形状和姿态估计方面实现了最先进的高效性。