Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high-dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.
翻译:感官替代可以帮助有感知缺陷的人。 在这项工作中,我们试图将音频与视频进行视觉化。 我们的长期目标是为听力受损者创造健全的感知,例如,为培训聋哑言提供反馈。 不同于现有的语言和文字或文字和图像翻译模式,我们的目标是立即和低层次翻译,适用于通用环境声音和人文演讲,不拖延地针对通用环境声音和人文演讲。 人文翻译任务不为人知。 我们的设计是通过将音频与视频压缩成共同的隐蔽空间,将音频与视频相转换。 我们的核心贡献是开发和评估尊重人类感知限度和通过执行前科和整合未受偏移图像翻译和分离的战略,最大限度地增加用户的舒适度。 我们从质量上和数量上表明,我们的音频视频模型在制作的视频中保持重要的音频特征,生成的脸和数字视频非常适合高维听功能的视觉化。 我们的设计是通过人类很容易的辨别和辨别声音、言词和讲者。