Estimating 3D shapes and poses of static objects from a single image has important applications for robotics, augmented reality and digital content creation. Often this is done through direct mesh predictions which produces unrealistic, overly tessellated shapes or by formulating shape prediction as a retrieval task followed by CAD model alignment. Directly predicting CAD model poses from 2D image features is difficult and inaccurate. Some works, such as ROCA, regress normalised object coordinates and use those for computing poses. While this can produce more accurate pose estimates, predicting normalised object coordinates is susceptible to systematic failure. Leveraging efficient transformer architectures we demonstrate that a sparse, iterative, render-and-compare approach is more accurate and robust than relying on normalised object coordinates. For this we combine 2D image information including sparse depth and surface normal values which we estimate directly from the image with 3D CAD model information in early fusion. In particular, we reproject points sampled from the CAD model in an initial, random pose and compute their depth and surface normal values. This combined information is the input to a pose prediction network, SPARC-Net which we train to predict a 9 DoF CAD model pose update. The CAD model is reprojected again and the next pose update is predicted. Our alignment procedure converges after just 3 iterations, improving the state-of-the-art performance on the challenging real-world dataset ScanNet from 25.0% to 31.8% instance alignment accuracy. Code will be released at https://github.com/florianlanger/SPARC .
翻译:从单一图像中估算 3D 形状和静态物体的形状和形状,对于机器人、增强现实和数字内容的创建有着重要的应用,这往往是通过直接网状预测实现的,这种预测产生不切实际的、过于隐蔽的形状或将形状预测作为检索任务进行,然后由 CAD 模型对 CAD 模型进行直接预测是困难和不准确的。一些工程,例如 ROCA, 回归的正常对象坐标, 并使用这些模型进行计算。 虽然这可以产生更准确的面状估计, 预测正常的物体坐标很容易发生系统故障。 我们利用一个分散、 迭接、 互换和复合的网状预测, 而不是依赖正常的物体坐标。 为此,我们将2D 图像信息, 包括我们直接从图像中估算的深度和表面正常值与3D CAD 模型信息 早期混杂在一起。我们用 CAD 模型、 随机配置和 深度和表面正常值进行取样。这种合并信息是向真实的预测网络的输入, SPARC- CAD CAD 更新了我们预测的 CAD 格式。