智能眼镜的实用立体深度系统 (A Practical Stereo Depth System for Smart Glasses)

Jialiang Wang,Daniel Scharstein,Akash Bapat,Kevin Blackburn-Matzen,Matthew Yu,Jonathan Lehman,Suhib Alsisan,Yanghan Wang,Sam Tsai,Jan-Michael Frahm,Zijian He,Peter Vajda,Michael F. Cohen,Matt Uyttendaele

from arxiv, Accepted at CVPR2023

We present the design of a productionized end-to-end stereo depth sensing system that does pre-processing, online stereo rectification, and stereo depth estimation with a fallback to monocular depth estimation when rectification is unreliable. The output of our depth sensing system is then used in a novel view generation pipeline to create 3D computational photography effects using point-of-view images captured by smart glasses. All these steps are executed on-device on the stringent compute budget of a mobile phone, and because we expect the users can use a wide range of smartphones, our design needs to be general and cannot be dependent on a particular hardware or ML accelerator such as a smartphone GPU. Although each of these steps is well studied, a description of a practical system is still lacking. For such a system, all these steps need to work in tandem with one another and fallback gracefully on failures within the system or less than ideal input data. We show how we handle unforeseen changes to calibration, e.g., due to heat, robustly support depth estimation in the wild, and still abide by the memory and latency constraints required for a smooth user experience. We show that our trained models are fast, and run in less than 1s on a six-year-old Samsung Galaxy S8 phone's CPU. Our models generalize well to unseen data and achieve good results on Middlebury and in-the-wild images captured from the smart glasses.

翻译：我们提出了一个生产化的端到端立体深度感知系统设计，该系统具有预处理、在线立体匹配和立体深度估计功能，并在立体匹配不可靠时回退到单目深度估计。我们深度感知系统的输出然后用于一个新颖的视图生成流水线，使用智能眼镜所捕捉到的视角图像来创建3D计算摄影效果。所有这些步骤都在移动电话的严格计算预算上执行，并且由于我们希望用户可以使用各种智能手机，所以我们的设计需要通用，不能依赖于特定的硬件或ML加速器，例如智能手机GPU。尽管每个步骤都经过了深入学习，但实用系统的描述仍然缺乏。对于这样的系统，所有这些步骤都需要相互配合，并在系统故障或不理想的输入数据情况下优雅地退回。我们展示了如何处理由于热量等原因造成的校准的不可预见的变化，强有力地支持野外深度估计，并仍然遵守所需的内存和延迟约束，以实现流畅的用户体验。我们表明，我们的训练模型速度很快，在六年前的三星Galaxy S8手机CPU上运行时间不到1s。我们的模型具有很好的概括能力，并在从智能眼镜捕捉的Middlebury和野外图像上获得良好的结果。