We propose a technique for learning single-view 3D object pose estimation models by utilizing a new source of data -- in-the-wild videos where objects turn. Such videos are prevalent in practice (e.g., cars in roundabouts, airplanes near runways) and easy to collect. We show that classical structure-from-motion algorithms, coupled with the recent advances in instance detection and feature matching, provides surprisingly accurate relative 3D pose estimation on such videos. We propose a multi-stage training scheme that first learns a canonical pose across a collection of videos and then supervises a model for single-view pose estimation. The proposed technique achieves competitive performance with respect to existing state-of-the-art on standard benchmarks for 3D pose estimation, without requiring any pose labels during training. We also contribute an Accidental Turntables Dataset, containing a challenging set of 41,212 images of cars in cluttered backgrounds, motion blur and illumination changes that serves as a benchmark for 3D pose estimation.
翻译:我们建议采用一种技术来学习单视3D对象构成的估算模型,方法是利用新的数据来源 -- -- 对象转向的现场视频。这些视频在实践中很普遍(例如环形汽车、跑道附近的飞机),而且易于收集。我们显示古典的从动结构算法,加上最近在试测和特征匹配方面的进展,提供了令人惊讶的准确相对的3D对此类视频的估算。我们提议了一个多阶段培训计划,首先在一组视频中学习罐头成形,然后监督一个单视面构成估计模型。拟议技术在3D构成估计标准基准的现有最新技术方面实现了竞争性性能,而无需在培训期间设置任何外形标签。我们还提供了一套意外变形数据集,其中包含了41 212幅充满挑战性的背景的汽车图象、运动模糊和光化变化,作为3D构成估计的基准。