We propose a new framework for extracting visual information about a scene only using audio signals. Audio-based methods can overcome some of the limitations of vision-based methods i.e., they do not require "line-of-sight", are robust to occlusions and changes in illumination, and can function as a backup in case vision/lidar sensors fail. Therefore, audio-based methods can be useful even for applications in which only visual information is of interest Our framework is based on Manifold Learning and consists of two steps. First, we train a Vector-Quantized Variational Auto-Encoder to learn the data manifold of the particular visual modality we are interested in. Second, we train an Audio Transformation network to map multi-channel audio signals to the latent representation of the corresponding visual sample. We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset. In particular, we consider the prediction of the following visual modalities from audio: depth and semantic segmentation. We hope the findings of our work can facilitate further research in visual information extraction from audio. Code is available at: https://github.com/ubc-vision/audio_manifold.
翻译:我们建议一个新的框架,只用音频信号来提取场景的视觉信息; 音频方法可以克服视觉方法的某些局限性, 即, 它们不需要“ 视觉线”, 能够捕捉和改变光化, 并且可以作为视觉/ 激光传感器失灵时的备份; 因此, 音频方法甚至对于只用视觉信息才感兴趣的应用也是有用的。 我们的框架以文艺学习为基础, 由两步组成。 首先, 我们训练一种矢量- 量化自动- Encoder, 学习我们感兴趣的视觉方法的数据。 其次, 我们训练一个音频转换网络, 绘制多频道音频信号图, 以显示相应的视觉样本的潜在表现。 我们显示, 我们的方法能够使用公开提供的音频/ 视频数据集从音频中产生有意义的图像。 我们特别考虑从音频: 深度和 语系分割中预测以下的视觉模式。 我们希望我们的工作结论能够促进从音频/ 视觉信息提取方面的进一步的研究。 代码: http:// magius_ bubcombdio.