This paper studies category-level object pose estimation based on a single monocular image. Recent advances in pose-aware generative models have paved the way for addressing this challenging task using analysis-by-synthesis. The idea is to sequentially update a set of latent variables, e.g., pose, shape, and appearance, of the generative model until the generated image best agrees with the observation. However, convergence and efficiency are two challenges of this inference procedure. In this paper, we take a deeper look at the inference of analysis-by-synthesis from the perspective of visual navigation, and investigate what is a good navigation policy for this specific task. We evaluate three different strategies, including gradient descent, reinforcement learning and imitation learning, via thorough comparisons in terms of convergence, robustness and efficiency. Moreover, we show that a simple hybrid approach leads to an effective and efficient solution. We further compare these strategies to state-of-the-art methods, and demonstrate superior performance on synthetic and real-world datasets leveraging off-the-shelf pose-aware generative models.
翻译:本文的分类研究对象根据单一的单体图像进行估计。 表面上认知的基因模型最近的进展为利用逐项分析来完成这项艰巨的任务铺平了道路。 我们的想法是按顺序更新基因模型的一组潜在变量,例如组合、形状和外观,直到生成的图像最能与观察结果一致。然而,趋同和效率是这一推论程序的两个挑战。在本文中,我们从视觉导航的角度更深入地审视分析逐项分析的推论,并调查这一具体任务的良好导航政策。我们通过在趋同、稳健和效率方面进行彻底的比较,评估三种不同的战略,包括梯度下降、强化学习和模仿学习。此外,我们表明,简单混合方法可以带来有效和高效的解决方案。我们进一步将这些战略与最新方法进行比较,并展示合成和真实世界数据集在利用现成的成形基因模型方面的优异性表现。