Pose estimation is usually tackled as either a bin classification problem or as a regression problem. In both cases, the idea is to directly predict the pose of an object. This is a non-trivial task because of appearance variations of similar poses and similarities between different poses. Instead, we follow the key idea that it is easier to compare two poses than to estimate them. Render-and-compare approaches have been employed to that end, however, they tend to be unstable, computationally expensive, and slow for real-time applications. We propose doing category-level pose estimation by learning an alignment metric using a contrastive loss with a dynamic margin and a continuous pose-label space. For efficient inference, we use a simple real-time image retrieval scheme with a reference set of renderings projected to an embedding space. To achieve robustness to real-world conditions, we employ synthetic occlusions, bounding box perturbations, and appearance augmentations. Our approach achieves state-of-the-art performance on PASCAL3D and OccludedPASCAL3D, as well as high-quality results on KITTI3D.
翻译:在这两种情况下,我们建议采用类别一级的方法,通过使用动态差幅和连续面容标签空间的对比性损失来学习校准指标来进行估测。为了有效的推断,我们采用简单的实时图像检索方法,用一组参照图解来预测嵌入空间。为了实现真实世界条件的稳健性,我们采用了合成封闭、捆绑盒子的扰动和外观增强。我们的方法在PASACL3D和Occclobed PASCAL3D上取得了最先进的表现,以及KITTI3上的高质量结果。