We introduce ViewNeRF, a Neural Radiance Field-based viewpoint estimation method that learns to predict category-level viewpoints directly from images during training. While NeRF is usually trained with ground-truth camera poses, multiple extensions have been proposed to reduce the need for this expensive supervision. Nonetheless, most of these methods still struggle in complex settings with large camera movements, and are restricted to single scenes, i.e. they cannot be trained on a collection of scenes depicting the same object category. To address these issues, our method uses an analysis by synthesis approach, combining a conditional NeRF with a viewpoint predictor and a scene encoder in order to produce self-supervised reconstructions for whole object categories. Rather than focusing on high fidelity reconstruction, we target efficient and accurate viewpoint prediction in complex scenarios, e.g. 360{\deg} rotation on real data. Our model shows competitive results on synthetic and real datasets, both for single scenes and multi-instance collections.
翻译:我们引入了ViewNeRF, 这是一种以神经辐射为主的实地观点估计方法,它学习从培训期间的图像中直接预测分类层面的观点。NeRF通常接受地面实况摄像师的训练,但为了减少这种昂贵的监督需求,我们提出了多项扩展建议。尽管如此,这些方法中的大多数仍然在复杂的环境中挣扎,摄像机移动很大,而且仅限于单一场景,即它们无法接受描述同一对象类别的场景的收集培训。为了解决这些问题,我们的方法采用综合分析方法,将一个有条件的NeRF与一个视图预测器和一个场景编码器结合起来,以便产生一个全对象类别的自我监督重建。我们的目标不是侧重于高度忠诚的重建,而是在复杂的场景中以高效和准确的观点预测为对象,例如360×deg}对真实数据进行旋转。我们的模型展示了合成和真实数据集的竞争性结果,对于单一场景和多场景的采集来说都是如此。