We address the problem of estimating depth with multi modal audio visual data. Inspired by the ability of animals, such as bats and dolphins, to infer distance of objects with echolocation, some recent methods have utilized echoes for depth estimation. We propose an end-to-end deep learning based pipeline utilizing RGB images, binaural echoes and estimated material properties of various objects within a scene. We argue that the relation between image, echoes and depth, for different scene elements, is greatly influenced by the properties of those elements, and a method designed to leverage this information can lead to significantly improved depth estimation from audio visual inputs. We propose a novel multi modal fusion technique, which incorporates the material properties explicitly while combining audio (echoes) and visual modalities to predict the scene depth. We show empirically, with experiments on Replica dataset, that the proposed method obtains 28% improvement in RMSE compared to the state-of-the-art audio-visual depth prediction method. To demonstrate the effectiveness of our method on larger dataset, we report competitive performance on Matterport3D, proposing to use it as a multimodal depth prediction benchmark with echoes for the first time. We also analyse the proposed method with exhaustive ablation experiments and qualitative results. The code and models are available at https://krantiparida.github.io/projects/bimgdepth.html
翻译:我们提出利用多模量视听数据来估计深度的问题。在蝙蝠和海豚等动物的能力的启发下,最近的一些方法利用回声测深,利用回声测深;我们提议利用RGB图像、双声波和各种物体在现场的估测物质特性,进行以端到端深学习为基础的管道;我们认为,图像、回声和深度之间的关系受到这些元素的特性的极大影响,而一种利用这种信息的方法能够大大改进视听投入的深度估计。我们建议采用新的多模量混集技术,在将音频(echoes)和视觉模式结合起来以预测场景深度的同时,明确纳入材料特性。我们通过复制数据集实验,以实验方式显示拟议的RMSE与最先进的视听深度预测方法之间的关系得到了28%的改善。为了展示我们在更大数据集上的方法的有效性,我们报告了MONPort3D的竞争性业绩,建议使用它作为模拟深度预测的模型,同时将音频(ech)和视觉模型作为模拟模型的模拟。我们还分析了在时间上现有的深度基准。我们用RMSE/RBI的模型分析。