3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using differentiable rendering we estimate the 3D object pose by minimizing the reconstruction error between NeMo and the feature representation of the target image. To avoid local optima in the reconstruction loss, we train the feature extractor to maximize the distance between the individual feature representations on the mesh using contrastive learning. Our extensive experiments on PASCAL3D+, occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to partial occlusion and unseen pose compared to standard deep networks, while retaining competitive performance on regular data. Interestingly, our experiments also show that NeMo performs reasonably well even when the mesh representation only crudely approximates the true object geometry with a cuboid, hence revealing that the detailed 3D geometry is not needed for accurate 3D pose estimation. The code is publicly available at https://github.com/Angtian/NeMo.
翻译:3D 显示 3D 显示 3D 显示 3D 显示 3D 显示 的标准深层次学习方法在物体被部分隐蔽或从先前看不见的外观中观察时并不可靠。 受基因化视觉模型强到部分隐蔽的启发, 我们提议将3D 显示物体的3D 显示的深度神经网络与3D 显示的3D 显示的标准神经结构整合为一个统一的神经结构。 具体地说, NeMo 学习了在密集的 3D 显示的每个顶端启动神经特征的基因化模型。 使用不同的显示我们通过尽可能减少 NeMo 与目标图像的特征描述之间的重建错误来估计 3D 显示 3D 。 为避免在重建损失中出现本地的Optima, 我们训练了功能提取器, 利用对比性学习, 将个人特征表现最大化。 我们在 PASAL3D+、 隐蔽- PASACAL3D + 和 目标Net3D 3D 上进行的广泛实验表明 NMo 部分隐形与标准的深度网络相比更加可靠。 在常规的精确数据上, 也保留有竞争力的精确的深度数据显示。 。