We present a passive stereo depth system that produces dense and accurate point clouds optimized for human environments, including dark, textureless, thin, reflective and specular surfaces and objects, at 2560x2048 resolution, with 384 disparities, in 30 ms. The system consists of an algorithm combining learned stereo matching with engineered filtering, a training and data-mixing methodology, and a sensor hardware design. Our architecture is 15x faster than approaches that perform similarly on the Middlebury and Flying Things Stereo Benchmarks. To effectively supervise the training of this model, we combine real data labelled using off-the-shelf depth sensors, as well as a number of different rendered, simulated labeled datasets. We demonstrate the efficacy of our system by presenting a large number of qualitative results in the form of depth maps and point-clouds, experiments validating the metric accuracy of our system and comparisons to other sensors on challenging objects and scenes. We also show the competitiveness of our algorithm compared to state-of-the-art learned models using the Middlebury and FlyingThings datasets.
翻译:我们提出了一个被动的立体深度系统,产生为人类环境优化的密度和准确的点云,包括暗、无纹、薄、反射和镜面表面和物体,以2560x2048分辨率计算,30米内有384个差异,30米内有384个差异。这个系统包括一种算法,将已知立体匹配与工程过滤、培训和数据混合方法以及传感器硬件设计相结合。我们的建筑比在中伯里和飞行物体标准上同样运行的方法快15x15倍。为了有效监督这一模型的培训,我们用现成的深度传感器以及若干不同的、模拟的标签数据集将真实数据标注在一起。我们通过以深度地图和点谱形式展示大量质量结果、验证我们系统基准准确性的实验以及用具有挑战性的物体和场景的其他传感器进行比较,展示了我们的算法与使用Middlebrary和FlyThing数据集的先进模型相比的竞争力。我们还展示了我们的算法的竞争力。