Depth estimation enables a wide variety of 3D applications, such as robotics, autonomous driving, and virtual reality. Despite significant work in this area, it remains open how to enable accurate, low-cost, high-resolution, and large-range depth estimation. Inspired by the flash-to-bang phenomenon (i.e. hearing the thunder after seeing the lightning), this paper develops FBDepth, the first audio-visual depth estimation framework. It takes the difference between the time-of-flight (ToF) of the light and the sound to infer the sound source depth. FBDepth is the first to incorporate video and audio with both semantic features and spatial hints for range estimation. It first aligns correspondence between the video track and audio track to locate the target object and target sound in a coarse granularity. Based on the observation of moving objects' trajectories, FBDepth proposes to estimate the intersection of optical flow before and after the sound production to locate video events in time. FBDepth feeds the estimated timestamp of the video event and the audio clip for the final depth estimation. We use a mobile phone to collect 3000+ video clips with 20 different objects at up to $50m$. FBDepth decreases the Absolute Relative error (AbsRel) by 55\% compared to RGB-based methods.
翻译:深度估计可以使3D应用程序(如机器人、自主驾驶和虚拟现实)的种类繁多。 尽管在这一领域做了大量工作, 但它仍然开放, 允许准确、 低成本、 高分辨率和大范围的深度估计。 受闪光到闪光现象( 看见闪电后听到雷声) 的启发, 本文开发了FBDepth, 这是第一个视听深度估计框架。 它将光的飞行时间( ToF) 与声音推导声音源深度之间的差别。 FBDeptih 是第一个将视频和音频都包含成语义特征和空间提示以进行范围估计的软件。 它首先将视频和音轨之间的通信与音频音频音频音频连接, 以粗略的颗粒度定位目标对象和目标声音。 根据对移动物体轨迹的观察, FBDepteh 提议估算光流在声音制作之前和之后的相交点, 以及时定位视频事件。 FBDepteh为视频事件的估计时间印本, 以及用于最终深度估测测度的音频值。 我们用55- B 移动手机收集了一个不同的频率, 。